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Professor  Andrea  L.  Bertozzi,  Chair 

The  demand  for  analyzing  patterns  and  structures  of  data  is  growing  dramatically 
in  recent  years.  The  study  of  network  structure  is  pervasive  in  sociology,  biology, 
computer  science,  and  many  other  disciplines.  My  research  focuses  on  network 
and  high-dimensional  data  analysis,  using  graph  based  models  and  tools  from 
sparse  optimization.  The  specific  question  about  networks  we  are  studying  is 
“clustering”:  partitioning  a  network  into  cohesive  groups.  Depending  on  the 
contexts,  these  tightly  connected  groups  can  be  social  units,  functional  modules, 
or  components  of  an  image. 

My  work  consists  of  both  theoretical  analysis  and  numerical  simulation.  We 
first  analyze  some  social  network  and  image  datasets  using  a  quality  function 
called  “modularity” ,  which  is  a  popular  model  for  clustering  in  network  science. 
Then  we  further  study  the  modularity  function  from  a  novel  perspective:  with  my 
collaborators  we  reformulate  modularity  optimization  as  a  minimization  problem 
of  an  energy  functional  that  consists  of  a  total  variation  term  and  an  L2  balance 
term.  By  employing  numerical  techniques  from  image  processing  and  L\  com¬ 
pressive  sensing,  such  as  the  Merriman-Bence-Osher  (MBO)  scheme,  we  develop 
a  variational  algorithm  for  the  minimization  problem. 


Along  a  similar  line  of  research,  we  work  on  a  multi-class  segmentation  prob¬ 
lem  using  the  piecewise  constant  Mumford-Shah  model  in  a  graph  setting.  We 
propose  an  efficient  algorithm  for  the  graph  version  of  Mumford-Shah  model  using 
the  MBO  scheme.  Theoretical  analysis  is  developed  and  a  Lyapunov  functional 
is  proven  to  decrease  as  the  algorithm  proceeds.  Furthermore,  to  reduce  the  com¬ 
putational  cost  for  large  datasets,  we  incorporate  the  Nystrom  extension  method 
to  efficiently  approximates  eigenvectors  of  the  graph  Laplacian  based  on  a  small 
portion  of  the  weight  matrix.  Finally,  we  implement  the  proposed  method  on  the 
problem  of  chemical  plume  detection  in  hyper-spectral  video  data.  These  graph 
based  clustering  algorithms  we  proposed  improve  the  time  efficiency  significantly 
for  large  scale  datasets.  In  the  last  chapter,  we  also  propose  an  incremental  reseed¬ 
ing  strategy  for  clustering,  which  is  an  easy-to-implement  and  highly  parallelizable 
algorithm  for  multiway  graph  partitioning.  We  demonstrate  experimentally  that 
this  algorithm  achieves  state-of-the-art  performance  in  terms  of  cluster  purity  on 
standard  benchmark  datasets.  Moreover,  the  algorithm  runs  an  order  of  magni¬ 
tude  faster  than  the  other  algorithms. 
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CHAPTER  1 


Introduction 

One  of  the  most  basic  unsupervised  learning  tasks  is  to  automatically  partition 
data  into  clusters  based  on  pair-wise  similarity.  A  standard  scenario  is  to  use  a 
weighted  graph  to  represent  the  data.  This  thesis  particularly  focuses  on  unsu¬ 
pervised,  non-overlapping  data  clustering  tasks. 

Data  clustering  has  been  a  widely  studied  problem  in  various  fields.  It  takes 
different  forms  depending  on  the  context  and  application,  such  as  community 
detection  in  network  science,  segmentation  problem  in  image  processing,  graph 
partitioning  in  parallel  computing.  My  work  starts  from  a  network  science  point 
of  view  using  graph  models.  Pursuing  a  similar  line  of  research,  we  extend  the 
developed  graph  tools  to  a  multi-class  segmentation  model  in  the  context  of  hyper- 
spectral  imagery  segmentation  and  general  high-dimensional  data  clustering.  At 
last,  we  also  study  an  incremental  reseeding  strategy  for  clustering  that  produces 
high  performance  with  very  low  computational  cost. 

1.1  Network  Community  Detection 

Networks  provide  a  useful  representation  for  the  investigation  of  complex  systems, 
and  they  have  accordingly  attracted  considerable  attention  in  sociology,  biology, 
computer  science,  and  many  other  disciplines  [70,71].  Most  of  the  networks  that 
people  study  are  graphs,  which  consist  of  nodes  (i.e.,  vertices)  to  represent  the 
elementary  units  of  a  system  and  edges  to  represent  pairwise  connections  or  in- 
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teractions  between  the  nodes. 


Using  networks  makes  it  possible  to  examine  intermediate-scale  structure  in 
complex  systems.  Most  investigations  of  intermediate-scale  structures  have  fo¬ 
cused  on  community  structure ,  in  which  one  decomposes  a  network  into  (possibly 
overlapping)  cohesive  groups  of  nodes  called  communities  [76].  There  is  a  higher 
density  of  connections  within  communities  than  between  them. 

In  some  applications,  communities  have  been  related  to  functional  units  in 
networks  [76].  For  example,  a  community  might  be  closely  related  to  a  functional 
module  in  a  biological  system  [55]  or  a  group  of  friends  in  a  social  system  [88]. 
Because  community  structure  in  real  networks  can  be  very  insightful  [33,37,71,76], 
it  is  useful  to  study  algorithmic  methods  to  detect  communities.  Such  efforts 
have  been  useful  in  studies  of  the  social  organization  in  friendship  networks  [88], 
legislation  cosponsorships  in  the  United  States  Congress  [100],  functional  modules 
in  biology  networks  [39,55],  and  many  other  situations. 

To  perform  community  detection,  one  needs  a  quantitative  definition  for  what 
constitutes  a  community,  though  this  relies  on  the  goal  and  application  that  one 
has  in  mind.  Perhaps  the  most  popular  approach  is  to  optimize  a  quality  function 
known  as  modularity  [68,72,73],  and  numerous  computational  heuristics  have  been 
developed  for  optimizing  modularity  [33,76].  The  modularity  of  a  network  parti¬ 
tion  measures  the  fraction  of  total  edge  weight  within  communities  versus  what 
one  might  expect  if  edges  were  placed  randomly  according  to  some  null  model. 
Modularity  gives  one  definition  of  the  “quality”  of  a  partition,  and  maximizing 
modularity  is  supposed  to  yield  a  reasonable  partitioning  of  a  network  into  disjoint 
communities. 

Community  detection  is  related  to  graph  partitioning,  which  has  been  applied 
to  problems  in  numerous  areas  [74,81,95].  In  graph  partitioning,  a  network  is  di¬ 
vided  into  disjoint  sets  of  nodes.  Graph  partitioning  usually  requires  the  number 
of  clusters  to  be  specified  to  avoid  trivial  solutions,  whereas  modularity  optimiza- 
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tion  does  not  require  one  to  specify  the  number  of  clusters  [76].  This  is  a  desirable 
feature  for  applications  such  as  social  and  biological  networks. 

The  idea  of  modularity  is  generalized  to  a  “multi-slice  network”  version  in  [64] , 
which  consists  of  layers  of  ordinary  networks  in  which  vertex  x  in  one  slice  is  con¬ 
nected  to  the  corresponding  vertex  in  other  slices,  via  a  coupling  constant.  Each 
layer  can  represent  relationships  induced  from  a  different  type  of  feature  of  the 
agents,  the  status  of  the  network  at  a  different  temporal  point,  or  the  network 
inspected  under  a  different  scale.  The  multi-slice  modularity  model  allows  us  to 
cluster  multiple  layers  of  a  network  simultaneously,  while  enforcing  some  consis¬ 
tency  in  clustering  identical  vertices  similarly  across  slices.  In  the  work  of  [47,91], 
we  implement  the  multi-slice  modularity  on  a  dataset  with  both  geographic  and 
social  information  about  stops  involving  street  gang  members  in  the  Los  Angeles 
Police  Department  (LAPD)  Division  of  Hollenbeck  [92],  Without  prior  knowledge 
of  the  number  of  gangs  of  the  members,  we  examined  network  diagnostics  over 
slices  to  attempt  to  estimate  the  number  of  gangs  that  is  stable  across  multiple 
scales,  which  turns  out  to  correspond  roughly  to  the  number  expected  by  the 
LAPD.  We  also  applied  this  technique  to  image  segmentation  and  tried  to  deter¬ 
mine  the  number  of  components  in  a  test  image  through  the  multi-slice  network 
model  [47]. 

Because  modularity  optimization  is  an  NP-hard  problem  [12],  efficient  algo¬ 
rithms  are  necessary  to  find  good  locally  optimal  network  partitions  with  rea¬ 
sonable  computational  costs.  Numerous  methods  have  been  proposed  [33,76]. 
These  include  greedy  algorithms  [23,67],  extremal  optimization  [11,28],  simulated 
annealing  [40,51],  spectral  methods  (which  use  eigenvectors  of  a  modularity  ma¬ 
trix)  [68,  79],  and  more.  The  locally  greedy  algorithm  by  Blondel  et  al.  [9]  is 
arguably  the  most  popular  computational  heuristic;  it  is  a  very  fast  algorithm, 
and  it  also  yields  high  modularity  values  [33,53]. 

In  [45],  we  interpret  modularity  optimization  (using  the  Newman-Girvan  null 
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model  [71,  73])  from  a  novel  perspective.  Inspired  by  the  connection  between 
graph  cuts  and  the  total  variation  (TV)  of  a  graph  partition,  we  reformulate  the 
problem  of  modularity  optimization  as  a  minimization  of  an  energy  functional  that 
consists  of  a  graph  cut  (i.e. ,  TV)  term  and  an  L2  balance  term.  By  employing 
numerical  techniques  from  image  processing  and  L\  compressive  sensing — such  as 
convex  splitting  and  the  Merriman-Bence-Osher  (MBO)  scheme  [62,63]we  propose 
a  variational  algorithm  to  perform  the  minimization  on  the  new  formula.  (The 
MBO  was  originally  introduced  to  approximate  motion  by  mean  curvature  of  an 
interface  in  Euclidean  space.)  We  apply  this  method  to  both  synthetic  benchmark 
networks  and  real  data  sets,  and  we  achieve  performance  that  is  competitive  with 
the  state-of-the-art  modularity  optimization  algorithms. 

1.2  Hyper-spectral  Image  Segmentation 

Multi-class  segmentation  has  been  studied  as  an  important  problem  for  many  years 
in  various  areas,  such  as  computer  science  and  machine  learning.  For  imagery  data 
in  particular,  the  Mumford-Shah  model  [65]  is  one  of  the  most  extensively  used 
model  in  the  past  decade.  This  model  approximates  the  true  image  by  an  optimal 
piecewise  smooth  function  through  solving  a  energy  minimization  problem.  More 
detailed  review  of  the  work  on  Mumford-Shah  model  can  be  found  in  the  references 
of  [20].  A  simplified  version  of  Mumford-Shah  is  the  piecewise  constant  model 
(also  known  as  the  “minimal  partition  problem”),  which  is  widely  used  due  to  its 
reduced  complexity  compared  to  the  original  one.  For  a  given  contour  <f>  which 
segments  an  image  region  12  into  n  many  disjoint  sub-regions  12  =  U”=112r,  the 
piecewise  constant  Mumford-Shah  energy  is  defined  as: 

n  » 

FMS($,  {cr}tl)  =  \$\  +  \J2  («0  -  cr)2  ,  (1.1) 

r=  1  J 

where  uq  is  the  observed  image  data,  {cr}"=1  is  a  set  of  constant  values,  and 
|$|  denotes  the  length  of  the  contour  <f>.  By  minimizing  the  energy  EMS  over 
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$  and  {cJJU,  one  obtains  an  optimal  function  which  is  constant  within  each 
sub-region  to  approximate  uq,  along  with  a  segmentation  given  by  the  optimal 
<f>.  In  [21],  a  method  of  active  contours  without  edges  is  proposed  to  solve  for 
the  two-class  piecewise  constant  Mumford-Shah  model  (h  =  2),  using  a  level  set 
method  introduced  in  [75].  The  work  in  [21]  is  further  generalized  to  a  multi-class 
scenario  in  [93].  The  method  developed  in  [21,93]  is  well  known  as  the  Chan-Vese 
model,  which  is  a  popular  and  representative  method  for  image  segmentation. 
The  Chan-Vese  method  has  been  widely  used  due  to  the  model’s  flexibility  and 
the  great  success  it  achieves  in  performance. 

In  the  work  of  [46],  we  study  the  multi-class  segmentation  problem  using 
the  Mumford-Shah  model  in  the  context  of  imagery  or  general  hight-dimensional 
datasets,  in  particular  hyper-spectral  data.  It  can  be  viewed  as  a  clustering  prob¬ 
lem  with  Mumford-Shah  model  being  the  quality  function  (in  contrast  to  modular¬ 
ity).  We  formulate  the  piecewise  constant  MS  problem  in  a  graph  setting  instead 
of  a  continuous  one,  and  propose  an  efficient  algorithm  to  solve  it.  Recently  the 
authors  of  [8]  introduced  a  binary  semi-supervised  segmentation  method  based 
on  minimizing  the  Ginzburg-Landau  functional  on  a  graph.  Inspired  by  [8],  a 
collection  of  work  has  been  done  on  graph-based  high-dimensional  data  clustering 
problems  posed  as  energy  minimization  problems,  such  as  semi-supervised  meth¬ 
ods  studied  in  [35,  59,  60]  and  an  unsupervised  network  clustering  method  [45] 
known  as  modularity  optimization.  These  methods  make  use  of  graph  tools  [22] 
and  efficient  graph  algorithms,  and  our  work  pursues  similar  ideas.  Note  that  un¬ 
like  the  Chan-Vese  model  which  uses  logz^h)  many  level  set  functions  and  binary 
representations  to  denote  multiple  classes,  our  model  uses  simplex  constrained 
vectors  for  class  assignments  representation.  To  solve  the  multi-class  piecewise 
constant  MS  variational  problem  in  the  graph  setting,  we  propose  an  efficient  al¬ 
gorithm  adopting  the  idea  of  the  MBO  scheme  [62,63].  The  MBO  scheme  is  used 
on  the  continuous  MS  model  [29,85]  motivated  by  level  set  methods.  The  authors 
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of  [35,45,59,60]  implement  variants  of  the  MBO  scheme  applied  to  segmentation 
problems  in  a  graph  setting.  Rigorous  proofs  of  convergence  of  the  original  MBO 
scheme  in  continuous  setting  can  be  found  in  [5, 31]  for  the  binary  case,  and  [30] 
for  the  multi-class  case.  An  analogous  discussion  in  a  graph  setting  is  given  in  [90]. 
Inspired  by  the  work  of  [30,90],  we  develop  a  Lyapunov  functional  for  our  pro¬ 
posed  variant  of  the  MBO  algorithm,  which  approximates  the  graph  MS  energy. 
Theoretical  analysis  is  given  to  prove  that  this  Lyapunov  energy  decreases  at  each 
iteration  of  our  algorithm,  until  it  converges  within  finitely  many  steps. 

In  order  to  solve  for  each  iteration  of  the  MBO  scheme,  one  needs  to  compute 
the  weight  matrix  of  the  graph  as  well  as  the  eigenvectors  of  the  corresponding 
graph  Laplacian.  However,  the  computational  cost  can  become  prohibitive  for 
large  datasets.  To  reduce  the  numerical  expenses,  we  implement  the  Nystrom 
extension  method  [34]  to  approximately  compute  the  eigenvectors,  which  only 
requires  computing  a  small  portion  of  the  weigh  matrix.  Thus  the  proposed  al¬ 
gorithm  is  efficient  even  for  large  datasets,  such  as  the  hyper-spectral  video  data 
considered  in  [46]. 

The  proposed  method  can  be  implemented  on  general  high-dimensional  data 
clustering  problems.  However,  in  this  work  the  numerical  experiment  is  focused  on 
the  detection  of  chemical  plumes  in  hyper-spectral  video  data.  Detecting  harm¬ 
ful  gases  and  chemical  plumes  has  wide  applicability,  such  as  in  environmental 
study,  defense  and  national  security.  However,  the  diffusive  nature  of  plumes 
poses  challenges  and  difficulties  for  the  problem.  One  popular  approach  is  to  take 
advantage  of  hyper-spectral  data,  which  provides  much  richer  sensing  information 
than  ordinary  visual  images.  The  hyper-spectral  images  used  in  this  work  were 
taken  from  video  sequences  captured  by  long  wave  infrared  (LWIR)  spectrome¬ 
ters  at  a  scene  where  a  collection  of  plume  clouds  is  released.  Over  100  spectral 
channels  at  each  pixel  of  the  scene  are  recorded,  where  each  channel  corresponds 
to  a  particular  frequency  in  the  spectrum  ranging  from  7,820  nm  to  11,700  nm. 
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The  data  is  provided  by  the  Applied  Physics  Laboratory  at  Johns  Hopkins  Uni¬ 
versity,  (see  more  details  in  [18]).  Prior  analysis  of  this  dataset  can  be  found  in 
the  works  [36,61,84,86].  The  authors  of  [61]  implement  a  semi-supervised  graph 
model  using  a  similar  MBO  scheme.  In  this  work,  each  pixel  is  considered  as  a 
node  in  a  graph,  upon  which  the  proposed  unsupervised  segmentation  algorithm 
is  implemented.  Competitive  results  are  achieved  as  demonstrated  below. 

1.3  An  Incremental  Reseeding  Strategy 

We  propose  an  easy-to-implcment  and  highly  parallclizable  algorithm  for  multiway 
graph  partitioning.  Different  from  the  methods  studied  in  the  previous  chapters,  it 
does  not  have  an  explicit  global  quality  function.  Instead,  this  heuristic  strategy 
takes  advantage  of  random  sampling  in  order  to  reduce  the  chances  that  the 
solutions  get  stuck  at  local  minima. 

The  algorithm  proceeds  by  alternating  three  simple  routines  in  an  iterative 
fashion:  diffusion,  thresholding,  and  random  sampling.  We  demonstrate  experi¬ 
mentally  that  the  proper  combination  of  these  ingredients  leads  to  an  algorithm 
that  achieves  state-of-the-art  performance  in  terms  of  cluster  purity  on  standard 
benchmark  datasets.  Moreover,  the  algorithm  runs  an  order  of  magnitude  faster 
than  the  other  algorithms  that  achieve  comparable  results  in  terms  of  accuracy. 

1.4  Outlines 

The  rest  of  this  thesis  is  organized  as  follows.  Chapter  2  introduces  the  background 
and  preliminaries  of  the  graph  setting,  the  definition  and  properties  of  the  graph 
Laplaeians.  A  few  representative  quality  functions  such  as  the  graph  cuts  and 
the  modularity  are  presented  for  the  general  clustering  problem.  In  Chapter  3,  a 
specific  method  called  the  multi-slice  modularity  optimization  is  implemented  on 
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social  and  imagery  data  to  study  the  network  structure,  and  its  performance  is 
evaluated  with  statistical  network  diagnostics. 

In  Chapter  4,  the  techniques  of  variational  methods  in  partial  differential  equa¬ 
tion  and  image  processing  are  discussed,  so  that  an  alternative  approach  to  the 
graph-based  clustering  problem  can  be  developed.  More  specifically,  the  Ginzburg- 
Landau  functional  and  the  MBO  scheme  are  introduced  in  Euclidean  space.  Then 
we  derive  an  equivalent  formula  of  modularity  optimization  as  a  minimization 
problem  of  an  energy  functional  that  consists  of  a  total  variation  term  and  an 
L2  balance  term.  A  graph  version  of  the  MBO  scheme  is  presented  for  solving 
the  minimization  problem;  and  numerical  tests  with  our  algorithms  on  several 
benchmark  and  real-world  networks  are  demonstrated. 

Chapter  5  introduces  the  graph  formula  for  the  multi-class  piecewise  constant 
Mumford-Shah  model  and  relevant  notations.  A  Mumford-Shah  MBO  scheme  is 
presented  as  well  as  the  theoretical  analysis  for  a  Lyapunov  functional  which  is 
proven  to  decrease  as  the  algorithm  proceeds;  techniques  such  as  Nystrom  method 
are  also  introduced  for  the  purpose  of  numerical  efficiency.  At  last,  the  proposed 
algorithm  is  tested  on  the  hyper-spectral  video  data  for  plume  detection  problem. 
The  results  are  then  presented  and  discussed. 

Chapter  6  introduces  an  incremental  reseeding  strategy  for  clustering.  Com¬ 
parisons  on  computational  time  and  performance  between  the  proposed  algorithm 
and  several  state-of-art  methods  are  presented. 


CHAPTER  2 


Preliminary 


In  this  chapter  we  firstly  introduce  the  basic  notations  for  the  graph  model  and 
some  useful  properties  of  spectral  graph  theory  [22,95].  Then  we  briefly  present 
several  commonly  used  clustering  models  based  on  graphs.  The  advantage  of 
graph-based  methods  is  that  it  can  reduce  the  high-dimensionality  of  the  feature 
space  via  a  similarity  metric;  It  can  also  reflect  non-local  properties  of  the  data. 

2.1  Graph  Basics 

A  graph  is  a  commonly  used  mathematical  tool  to  represent  a  network  and  rela¬ 
tionships  between  data  samples,  as  illustrated  in  Figure  2.1.  Consider  an  IV-node 
graph  (G,  E),  which  consists  of  a  node  set  G  =  {ni, ni , ,un}  and  an  edge  set 
E  =  {wij}f=l.  Each  node  nx  corresponds  to  an  agent  in  a  given  dataset,  such 
as  a  pixel  in  an  image,  or  a  person  in  a  social  network.  In  a  weighted  graph, 
each  edge  is  assigned  with  a  non- negative  quantity  wtj ,  representing  the  similarity 
between  a  pair  of  nodes  nt  and  rtj.  If  an  edge  is  not  included  in  the  edge  set  E ,  it 
is  assumed  to  be  equivalent  to  having  zero  similarity.  Let  W  =  [vjtJ]  denote  the 
graph’s  N  x  N  similarity  matrix  (or  weight  matrix). 

In  this  work  we  only  focus  on  undirected  graph,  i.e.  W  is  symmetric,  Wij  =  Wji. 
Although  the  directed  graph  (i.e.  ^  )  is  also  studied  in  the  literature  for 

relationships  that  are  not  mutual,  (such  as  one  user  “following”  another  on  Twitter 
but  not  the  other  way  around),  it  is  beyond  the  realm  of  this  work.  If  not  specified 
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Figure  2.1:  Illustration  of  an  undirected  weighted  graph. 

otherwise,  the  similarity  weight  Wij  is  always  nonnegative.  For  unweighted  graph, 
simply  assign  with  value  of  either  one  or  zero. 

2.1.1  Feature  vector  Sc  similarity  metric 

In  order  to  build  a  graph  that  represents  the  relationships  between  agents  in 
the  network,  one  needs  a  similarity  metric  to  compute  the  weight  w^,  i.e.  how 
to  quantify  the  similarity  between  node  n*  and  node  rij.  In  a  dataset,  usually 
each  agent  (node)  comes  with  a  so-called  “feature  vector”  which  contains  feature 
information  relevant  to  the  problem.  In  a  social  network,  a  feature  vector  can 
contain  entries  about  age,  gender,  occupation,  etc.  Note  that  each  entry  of  the 
feature  vector  should  be  normalized  accordingly,  for  example,  one  person’s  age 
and  salary  have  quite  different  ranges  of  values  and  it  is  reasonable  to  normalize 
them  to  a  similar  range.  A  descriptive  feature  such  as  gender  can  be  treated  as  a 
binary  value  in  the  feature  vector. 

Let  vector  Vi  denotes  the  feature  vector  of  node  nl .  A  similarity  metric  is  a 
mapping  from  ( Vi,Vj )  to  w^.  A  common  choice  for  the  similarity  metric  is 

dist^  (v^  ,Vj  ) 

Wij  =  e  2^2 

where  dist(uj,  v3)  can  be  any  suitable  metric  distance  between  Vi  and  Vj  in  Eu¬ 
clidean  space,  such  as  Li  or  L2  norm,  or  the  angle  between  the  two  vectors.  The 
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choice  of  the  similarity  metric  highly  depends  on  the  data  and  the  application. 

Note  that  if  the  dimension  of  feature  vectors  is  too  high,  some  algorithms  may 
not  scale  well.  In  that  case,  one  can  reduce  the  dimensionality  using  principal 
component  analysis  (PCA)  [49].  A  basic  PCA  dimension  reduction  process  can 
be  performed  as  following. 

Given  a  set  of  data  samples  with  high-dimensional  d  x  1  features  vectors 
Xi,  x2,  ■  ■  ■ ,  x jv  €  Md,  we  want  to  reduce  the  dimension  from  d  to  d: 

1.  Let  yU  denote  the  empirical  mean  of  the  feature  vectors:  =  jjYliLi  xi- 

Build  a,  N  x  d  matrix 

X  =  (xi  —  n,  x2  —  n,  ■  ■  • ,  xn  —  n)  . 

2.  Compute  the  d  x  d  scatter  matrix  S  =  X7  X. 

3.  Because  S  is  positive  semi  definite,  one  can  compute  the  eigenvectors  of  S: 

S  =  UT\U . 

4.  Let  V  =  •  •  •  iud)i  a  N  x  d  matrix,  where  U{  is  i-th  eigenvector  of  S 

associated  with  the  largest  eigenvalues. 

5.  The  1  x  d  rows  of  V  are  the  new  feature  vectors  with  reduced  di¬ 

mensions. 

2.1.2  Graph  functions 

Graph  functions  and  operators  applied  on  them  are  important  components  when 
studying  graph  models  for  clustering  problems.  In  order  to  represent  a  partition 
on  the  graph,  consider  a  set  of  graph  functions  /  =  (/i,  f2, . . . ,  :  G  — »  Mn  : 


The  simplex  constrained  vector  value  taken  by  /  G  IB  can  indicate  class  assign¬ 
ment,  i.e.  if  fr{ui )  =  1  for  some  r,  then  n%  belongs  to  the  r-th  class.  Thus  for  each 
/  e  B,  it  corresponds  to  a  partition  of  the  graph  G  with  at  most  n  classes;  and 
every  partition  can  be  represented  by  a  unique  function  in  IB  accordingly.  One  can 
think  of  B  as  a  a  set  of  generalized  binary  indicator  functions.  Indeed,  the  r-th 
entry  fr  of  a  function  /  G  IB  is  a  1-0  binary  indicator  function  of  the  r-th  class. 
Thus,  a  set  IB  covers  all  the  relvant  functions  concerning  a  clustering  problem. 

However,  the  strict  constraints  on  IB  makes  it  difficult  to  apply  useful  operators. 
A  more  general  admissible  set  for  graph  functions  /  =  (/1;  /2, . . . ,  fh)  :  G  — >  Mn 
is  defined  as: 

K  :=  j  /|  /  :  G  ->  [0, 1]*,  J2  /rW  =  1 

l  r= 1 

which  is  a  compact  and  convex  set.  Note  that  the  set  IB  is  a  subset  of  IK.  The 
graph  functions  in  K  take  values  on  the  boundary  of  a  high  dimensional  simplex, 
while  those  in  IB  take  values  on  the  vertex  of  a  simplex. 


2.1.3  Graph  Laplacian 


One  of  the  most  commonly  used  operator  on  graph  functions  is  the  so-called  (un¬ 
normalized)  graph  Laplacian  matrix  L  :=  D— W  [22],  where  D  is  a  diagonal  matrix 
with  the  i-th  entry  being  the  node  strength  (degree)  d*  =  J2  f=iwij-  A  variant  of 

_i  _i 

the  Laplacian  is  the  symmetric  normalized  Laplacian  L,sym  :=  I  —  D  2WD  2 . 
where  I  is  the  identity  matrix  and  assuming  every  node  has  nonzero  degree  (i.e. 
di  >  0).  The  Laplacian  operators  have  many  useful  properties  which  can  give 
insights  to  the  graph’s  structure,  as  studied  in  depth  in  spectral  graph  theory  [22] . 
Here  we  introduce  several  commonly  used  observations  of  the  Laplacian  operators. 


For  a  real  valued  graph  function  a  :  G  — >  M,  observe  that 


N 


(a,  La)  =  -  ^2  m,j  in  i  n  ,  :  -  nin  .  ia  , 


(2.1) 


i,j= 1 
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where  (•,  •)  is  the  inner  product.  For  vector-valued  /  =  (/i,  /2, . . . ,  fa)  :  G  — >  Mn, 
let  (/,  L/)  =  Y^r=i  (fr,  L fr).  One  can  see  from  (2.1)  that  L  is  positive  semidehnite, 

_i  _i 

so  is  the  operator  Lsym  =  D  2  LD  2 ; 


The  right  side  of  (2.1)  is  analogous  to  f  \Va\2dw  in  Euclidean  space,  with  Wij 
corresponding  to  the  measure  dw.  Note  that  the  first  order  derivative  of  /  |Va|2d?n 
yields  the  negative  Laplacian  operator  —A  in  heat  equations,  according  to  the 
calculous  of  variation. 

Similar  to  the  Laplacian  operator  A  in  partial  differential  equations  (PDE), 
the  eigen-space  of  the  graph  Laplacians  L  and  Lsym  can  reflect  a  lot  of  information 
about  the  graph.  People  have  been  using  the  leading  eigenvectors  of  the  graph 
Laplacians,  (which  correspond  to  the  smallest  eigenvalues),  to  partition  the  graph 
into  meaningful  subgroups  [95]. 

Because  the  Laplacians  are  positive  semidehnite,  the  smallest  possible  eigen¬ 
value  is  zero.  Consider  a  graph  with  exactly  k  connected  components: 

G  =  GiUG2U...UGfc, 

where  any  pair  of  nodes  that  belong  to  two  different  components  have  zero  edge 
weight  between  them.  Observe  that  the  indicator  function  xgs  of  each  connected 
component  Gs  is  an  eigenvector  of  L  associated  with  eigenvalue  zero.  In  fact,  for 
Laplacian  L,  the  eigenvalue  zero  has  the  exact  algebraic  multiplicity  of  k.  In  other 
words,  L’s  eigen-space  of  eigenvalue  zero  has  k  dimensions,  with  the  k  indicator 
functions  x,Gi ,  Xg2i  •  •  • ,  XGk  being  a  set  of  orthogonal  basis.  On  the  one  hand,  it  is 
trivial  to  see  that  the  zero  eigenvalue  has  a  multiplicity  of  at  least  k;  on  the  other 
hand,  Proposition  1  shows  that  the  eigen-space  of  eigenvalue  zero  is  spanned  by 
the  k  indicator  functions  XGi,  Xg2,  ■  ■  ■  >  XGk,  and  thus  of  dimension  k. 
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Proposition  1.  Every  eigenvector  of  the  graph  Laplacian  L  associated  with  eigen¬ 
value  zero  must  have  constant  value  over  each  connected  component  of  the  graph. 

Proof.  Suppose  a  graph  G  has  k  connected  components  G  —  G\  U  G2  U  . . .  U  Gj~- 
Given  an  eigenvector  a  that  satisfies  La  =  0,  we  want  to  show  that  a  is  constant 
over  Gs,  for  any  s  G  {1,  2, . . . ,  k}. 

For  fixed  s  G  {1,  2, . . . ,  k},  without  loss  of  generality,  one  can  assume 

n\  =  argrriaxn .  eGs  {  a  (nl ) }  . 

Because  La  =  0  and  Wy  =  0  for  nj  qL  Gs,  we  have 

N 

dia(ni)  =  E  a(nj)wij  =  ^2  a(n j) tv \j 

j= 1  n,j&Gs 

N 

<  E  afn^wij  =  a(n  1)  E  wij  =  d\a{ni ) .  (2.3) 

n j&Gs  j=l 

The  equality  holds  if  and  only  if  w\j  =  0  or  a(nj)  =  a(ni).  It  implies  that 
every  node  rq-  in  Gs  that  has  non-zero  edge  connection  to  ri\  must  satisfy  a(rij)  = 
a(ni).  Hence  a  takes  the  constant  value  a(ni)  on  Gs,  because  Gs  is  a  connected 
component.  Thus  a  is  constant  on  every  connected  component  of  the  graph  G. 

□ 

Therefore,  by  looking  at  the  orthogonal  eigenvectors  associated  with  eigenvalue 
zero  one  can  obtain  the  number  of  connected  components  in  the  graph.  However, 
for  real  world  clustering  problems,  people  are  more  concerned  with  less  ideal 
situations,  where  the  clusters  of  the  graph  are  not  necessarily  fully  disconnected 
from  each  other.  Thus,  the  goal  is  to  identify  the  “loosely”  connected  components 
in  the  graph. 

In  practice,  people  find  the  leading  eigenvectors  associated  with  small  eigen¬ 
values  (close  to  zero)  very  useful  in  determining  such  components  [7,22,24,68,81]. 
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One  way  of  interpretation  is  that  these  leading  eigenvectors  approximate  the  zero- 
eigen-space  in  the  case  discussed  above  where  clusters  are  fully  disconnected,  and 
thus  approximate  the  indicator  functions  of  the  clusters.  Another  way  of  un¬ 
derstanding  the  use  of  leading  eigenvectors  is  through  the  equation  (2.1).  For  a 
connected  graph,  the  first  eigenvector  0i  of  the  graph  Laplacian  L  is  the  constant 
vector  1  (up  to  a  constant),  with  value  one  at  each  entry.  The  second  eigenvector 
02  satisfies: 

02  =  argmina±1  =  argmin|H|=l  a±1  ]-  ^  Wij  (a(n;)  -  a(ni))2  .  (2.4) 

\a,  a;  z iJ=1 

To  minimize  the  righthand  side  of  equation  (2.4),  02  takes  similar  value  on  node 
nt  and  n3  when  has  a  large  value.  Therefore  intuitively  speaking,  the  values  of 
02  are  grouping  together  nodes  with  strong  connections,  and  separating  the  ones 
with  loose  connections.  The  relation  between  the  graph  Laplacian  and  a  quality 
function  called  graph  cuts  is  mentioned  in  Section  2.2. 

In  many  cases  the  normalized  Laplacian  Lsym  is  preferred  because  it  is  more 
regularized,  (see  normalized  cuts  in  Section  2.2).  For  non-zero  eigenvalues,  the 
corresponding  eigenvectors  of  L  denoted  by  {0S}  can  not  be  directly  derived  from 
the  eigenvectors  of  Lsym  denoted  by  {0s},  nor  the  other  way  around.  However, 
for  the  zero  eigenvalue,  one  can  derive: 

„  „  _i  _i  „ 

Ls?;m0s=05-D  2WD  2  0S  =  0 

=>■  D  20s  —  D_1WD~20s  =  0 

=>  L(D-4s)  =  D(D-4,)-W(D-4,)  =  0 

=>  0s  =  D~4s-  (2.5) 

Therefore,  L’s  eigen-space  of  eigenvalue  zero  can  be  derived  directly  from  that  of 
L Sym,  and  vice  versa.  Results  similar  to  Proposition  1  can  be  shown  for  Lsym. 
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2.2  Quality  Functions 


Various  approaches  have  been  studied  to  perform  clustering  on  datasets  repre¬ 
sented  by  graphs.  A  more  detailed  review  can  be  found  in  [33]  and  its  references. 
The  main  work  of  this  thesis  (Chapter  3-5)  focuses  on  the  type  of  methods  which 
is  based  on  a  global  quality  function. 

To  be  more  specific,  in  order  to  cluster  or  partition  a  graph,  one  needs  a  quality 
function  to  quantify  and  measure  how  “good”  a  clustering  is.  The  quality  function 
models  how  clusters  are  defined  and  what  properties  of  a  clustering  is  favored, 
which  varies  depending  on  different  applications  at  hands.  Once  a  quality  function 
is  chosen,  the  remaining  task  is  to  optimize  it  over  all  admissive  clusterings,  either 
maximization  or  minimization  depending  on  the  specific  quality  function. 

Normally  a  quality  function  favors  clusters  that  have  tight  connections  within, 
but  are  loosely  connected  with  each  other.  How  to  balance  these  criteria  is  the 
core  of  a  quality  function.  This  section  introduces  several  representative  quality 
functions:  the  family  of  graph  cuts  and  its  connection  to  the  graph  total  variation , 
and  the  modularity  function.  Note  that  there  are  many  other  quality  functions  in 
the  literature  other  than  what  is  covered  in  this  section,  but  the  graph  cuts  and 
the  modularity  function  can  give  a  good  picture  of  this  type  of  methods. 

2.2.1  Graph  Cuts 

The  most  common  concept  in  graph  based  models  is  perhaps  the  graph  cuts. 
Consider  a  partitioning  of  the  graph  G  =  Ai  U  A2  U  . . .  An  where  {Ar}”=1  are 
disjoint  subsets  of  G.  Let  g  :  G  — >  {1,  2, . . . ,  n}  be  the  group  assignment  of  each 
node,  i.e.  a  node  nl  belongs  to  Ar  if  and  only  if  g*  :=  g(rq )  takes  the  value  r.  The 
graph  cuts  of  this  partitioning  is  defined  as  the  sum  of  all  the  edge  links  one  needs 
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to  cut  in  order  to  separate  {Ar}"=1  from  each  other: 


Cut(Ai,  A2, An)  :=  ^2  wij  •  (2.6) 

gi^gj 

The  graph  cuts  in  the  case  of  n  =  2  is  illustrated  in  Figure  2.2  (b),  where  a  red 
dash  line  is  splitting  the  graph  into  two  groups  of  nodes.  The  corresponding  graph 
cuts  for  this  splitting  is  the  sum  of  the  two  edges  being  cut  through  by  the  dash 
line. 


From  the  definition  one  can  see  that  the  graph  cuts  is  measuring  the  inter¬ 
connections  between  clusters.  Therefore,  if  one  wants  to  partition  a  graph  into 
loosely  connected  subgroups,  the  corresponding  graph  cuts  is  preferred  to  be  small. 
A  general  principle  of  clustering  can  thus  be  finding  a  clustering  that  minimizes 
the  graph  cuts  over  all  admissive  ones. 

However,  if  the  quality  function  is  set  to  be  the  graph  cuts  only,  it  tends  to 
yield  solutions  where  most  clusters  contains  only  one  single  node  while  one  cluster 
contains  most  of  the  nodes  in  the  graph.  Such  solutions  are  trivial  and  can  not 
provide  useful  information.  To  resolve  this  issue,  people  therefore  add  certain 
balance  terms  into  the  graph  cuts  so  that  the  sizes  of  clusters  are  favored  to  be 
similar. 


One  popular  version  of  modified  graph  cuts  is  the  normalized  cuts  [81],  usually 
in  the  two-cluster  form  (i.e.  n  —  2,  G  —  A\  U  A2): 


Ncut(A!,  A2) 


Cut(Ai,A2)  Cut(Ai,A2) 

vo^Ax)  vol(A2)  ’ 


(2.7) 


where  the  quantity  vol(A)  denotes  the  volume  of  a  subset  A  e  G  defined  as: 


vol(A)  :=  ^  di  ■ 

rii^A 

In  the  normalized  cuts,  the  volume  terms  in  the  denominators  serve  the  purpose 
of  balancing  the  cluster  sizes.  Because  it  favors  equal  volumes  on  A\  and  A2, 
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trivial  solutions  are  likely  to  be  avoided.  Therefore,  using  the  normalized  cuts  as 
the  quality  function  is  more  well-posed  than  the  graph  cuts  itself. 

Finding  the  minimizer  of  the  normalized  cuts  is  NP  hard  [96].  Therefore 
relaxing  the  problem  and  solving  it  approximately  is  a  common  option.  Among 
many  approaches  explored  in  the  literature,  spectral  clustering  [95]  is  a  popular  one 
to  solve  the  relaxed  version  of  the  normalized  cuts.  Analysis  in  [81]  shows  that  the 
eigenvectors  of  the  normalized  Laplacian  Lsym  is  an  approximate  solution  to  the 
minimization  of  Ncut.  More  specifically,  to  bipartition  the  graph  the  normalized 
spectral  clustering  first  computes  the  second  eigenvector 

02  =  argmin  ±1  (a’*Jav™a).  ,  (2.8) 

fa,  a) 

and  then  cluster  the  nodes  into  two  groups  according  to  the  signs  of  02’s  value. 
This  method  has  been  widely  used  due  to  its  efficiency  and  relatively  good  per¬ 
formance. 

There  are  other  variants  of  the  graph  cuts  such  as  the  ratio  cut  [41]  and  the 
Cheeger  cut  [15];  and  alternative  relaxations  are  explored  [15,16]. 


(a)  Indicator  function  (b)  Graph  cuts 


Figure  2.2:  (a).  An  indicator  function  /  of  a  set  A,  illustrating  that  the  length  of 
A!s  perimeter  is  the  total  variation  of  f.  (b).  Illustration  of  the  graph  cuts. 
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2.2.2  Total  Variation 


In  Euclidean  space,  the  total  variation  (TV)  norm  of  a  real  valued  function  /  is 
defined  as  the  L\  norm  of  /’ s  gradient: 

\f\TV.=  f  |v/|.  (2.9) 

The  TV  norm  has  been  used  as  a  very  useful  regularizer  in  image  processing  and 
L\  compressive  sensing,  such  as  in  the  Rudin-Osher-Fatemi  denoising  model  [80]. 
An  intuitive  interpretation  of  TV  is  illustrated  in  Figure  2.2  (a):  given  a  subset 
A  in  the  problem  domain  and  an  indicator  function  /  =  xa,  the  length  of  region 
A’s  perimeter  equals  the  TV  norm  of  /: 

I /|  tv  =  length  (T^) , 

where  T^  denotes  the  boundary  of  the  subset  A.  Therefore  by  adding  the  total 
variation  term  into  the  energy  (quality  function)  of  a  minimization  problem,  one 
can  enforce  a  more  regular  boundary  in  a  segmentation,  or  a  smoother  intensity 
variation  in  an  image. 

Analogously,  people  define  a  graph  total  variation  for  /  :  G  — >  M 

1  N 

I /I  TV  :=  2  Wi'j  l/K)  -  /KOI  ■  (2-10) 

i,j= 1 

Interestingly,  the  total  variation  on  a  graph  function  has  very  close  connection  to 
the  graph  cuts  introduced  in  the  last  section.  In  fact,  we  have 

Cut  (A,  Ac )  =  \xa\tv  ■ 


In  another  words,  the  graph  cuts  of  a  partition  can  be  seen  as  the  boundaries  of 
the  subsets. 

Comparing  the  equations  (2.1)  and  (2.10),  one  can  observe  that  when  /  =  xa, 
the  equality  holds: 

(/•  L/)  =  |/|tv  •  (2-11) 
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Therefore,  (/,  L /)  is  an  L2  relaxation  of  |/|Ty  and  hence  the  graph  cuts  Cut(A,  Ac). 
This  is  an  intuitive  interpretation  of  why  the  spectral  clustering  method  is  an  ap¬ 
proximate  solution  to  the  family  of  graph  cuts  problems. 

2.2.3  Modularity 

From  the  perspective  of  network  science,  the  clustering  problem  is  posed  as  “com¬ 
munity  detection1'  [76].  Communities  are  functional  units  of  a  network  which  have 
close  links  within  but  loose  connections  with  the  outside.  Usually  the  number  of 
clusters  (communities)  is  not  pre-assigned  for  community  detection. 

A  popular  method  for  community  detection  is  the  modularity  optimization 
[69,71-73].  The  quality  function  modularity  is  to  be  maximized  to  achieve  a  good 
clustering  of  a  network.  The  modularity  of  a  partition  g  is  defined  as 

Q(9)  =  ^  [  w^  ~  )  S(9»9j) ,  (2-12) 

where  gt  G  {1,2 , ,  h}  is  the  community  assignment  of  nl ,  and  7  is  a  resolution 
parameter  [78].  The  term  S(gi,gj )  =  1  if  gt  =  gj  and  5(gi,gj )  =  0  otherwise.  The 
resolution  parameter  can  change  the  scale  at  which  a  network  is  clustered  [33,76]. 
A  network  breaks  into  more  communities  as  one  increases  7.  The  resolution 
parameter  implicitly  controls  the  number  of  clusters.  In  modularity  optimization, 
there  is  no  fixed  number  of  clusters  being  enforced  directly  in  the  quality  function. 

The  modularity  of  a  graph  partition  measures  the  fraction  of  total  edge  weight 
within  each  community  minus  the  edge  weight  that  would  be  expected  if  edges 
were  placed  randomly  using  some  null  model  [76].  The  most  common  null  model 
is  the  Newman-Girvan  (NG)  model  [73],  which  assigns  the  expected  edge  weight 
between  rq  and  77  to  be  recalling  that  d*  =  Yl^=i  wis  is  the  strength  (i.e., 
weighted  degree)  of  a  node  rq  and  the  quantity  2m  =  Y^iLi  di  is  ^ie  total  volume 
(i.e.,  total  edge  weight)  of  the  graph  ( G,E ).  An  advantage  of  the  NG  null  model 
is  that  it  preserves  the  expected  strength  distribution  of  the  network.  As  shown 
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in  the  definition  (2.12),  we  adopt  the  term  r^L  as  the  null  model  in  this  work. 
The  literature  on  algorithms  of  modularity  optimization  is  briefly  introduced  in 
Section  4.1. 

2.2.4  Number  of  clusters 

The  number  of  clusters  is  an  important  factor  of  clustering  algorithms.  In  some 
methods  such  as  k-means,  there  is  a  pre-assigned  number  of  clusters  which  serves 
as  an  explicit  constraint  in  the  quality  function.  For  some  other  methods  such 
as  the  modularity,  the  number  of  clusters  is  not  specified,  but  rather  induced 
implicitly  by  other  parameters  in  the  quality  function.  Sometimes  a  method  can 
only  do  a  bipartitioning  at  a  time  (i.e.  partitioning  into  two  subgroups),  and 
people  therefore  implement  it  recursively  to  achieve  a  desirable  number  of  clusters. 

These  methods  suit  different  purposes  depending  on  the  applications.  For  tasks 
such  as  partitioning  computer  work  loads  for  parallel  computing,  the  number  of 
clusters  is  fixed.  However,  for  community  detection  in  social  networks,  usually 
there  is  no  “true1'  number  of  clusters  in  the  clustering  problem.  Because  it  all 
depends  on  at  what  scale  one  examines  the  network:  coarse  or  fine,  high  level 
structure  or  low  level  detail.  In  network  science,  it  makes  more  sense  to  look  at  a 
range  of  scales  in  order  to  understand  a  fuller  picture  of  the  data,  rather  than  a 
single  fixed  scale  or  a  fixed  number  of  clusters. 
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CHAPTER  3 


Network  Analysis  using  Multi-slice  Modularity 

Optimization 


In  this  chapter,  a  generalized  modularity  function  is  discussed  and  implemented 
on  both  social  network  and  imagery  datasets.  Diagnostic  analysis  of  the  data  is 
performed  accordingly  to  understand  the  network  structure. 

3.1  Multi-slice  Modularity 

As  introduced  in  the  previous  chapter,  the  modularity  Q  defined  in  the  equa¬ 
tion  (2.12)  is  a  quality  function  that  measures  the  “goodness”  of  a  clustering. 
Finding  a  network  partition  that  attempts  to  maximize  Q  allows  one  to  probe  a 
network’s  community  structure.  In  contrast  to  traditional  forms  of  spectral  clus¬ 
tering,  modularity  optimization  requires  no  knowledge  of  the  number  or  sizes  of 
communities,  and  it  also  allows  one  to  segment  a  network  into  communities  of 
disparate  sizes  [69,76]. 

Optimization  of  modularity  was  recently  generalized  to  a  “multi-slice”  network 
framework  [64].  It  consists  of  layers  of  ordinary  networks,  which  share  the  same 
node-set  but  may  have  different  edge-set.  In  another  words,  for  the  s-th  slice 
there  is  a  similarity  matrix  Ws  representing  the  intra-slice  connections  between 
nodes  {nls}^Ll .  Additionally  there  are  also  inter-slice  connections  linking  the 
corresponding  nodes  across  slices,  such  as  between  nodes  nir  and  riiS  from  the  r-th 
and  the  s-th  slice  respectively.  This  framework  of  the  multi-slice  networks  can 
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thereby  be  used  to  represent  time-dependent  or  multi-scale  networks. 


Under  this  framework,  the  generalized  modularity  function  proposed  in  [64]  is 
defined  as: 


(3.1) 


where  (jjr  indicates  that  community  assignment  of  node  j  from  slice  r,  the  intra¬ 
slice  edge  strength  of  node  j  in  the  s-th  slice  is  djS  =  Yhiwijsi  the  corresponding 
inter-slice  edge  strength  is  c]S  =  ^rCjsr)  and  2 /i  =  J2jrdjr  +  Cjr.  In  (3.1),  one 
can  use  a  different  resolution  parameter  in  each  slice.  For  a  given  slice  s,  the 
quantity  WijS  gives  the  edge  weight  between  nodes  nis  and  rijS.  For  a  given  node 
j,  the  quantity  Cjsr  gives  the  inter-slice  coupling  between  the  rth  and  sth  slices. 

In  Fig.  3.1,  we  show  a  schematic  of  the  multi-slice  network. 


Figure  3.1:  Schematic  of  a  multislice  network. 


Optimization  of  the  ordinary  modularity  function  (2.12)  has  been  used  to 
study  community  structure  in  myriad  networks  [76],  and  it  has  also  been  used 


in  the  analysis  of  hyperspectral  images  [58]  recently.  In  this  work,  we  optimize 
multislice  modularity  (3.1)  to  examine  community  structure  in  social  networks  and 
segmentation  of  images.  In  each  case,  we  start  with  a  static  graph,  and  each  layer 
of  the  multi-slice  network  uses  the  same  adjacency  matrix  but  associates  it  with 


23 


a  different  resolution-parameter  value  rys.  We  include  inter-slice  edges  between 
each  node  j  in  adjacent  slices  only,  so  Cjsr  =  0  unless  |r  —  s\  =  1.  We  set  all 
nonzero  inter-slice  edges  to  a  constant  value  to.  This  setup,  which  was  illustrated 
using  the  infamous  Zachary  Karate  Club  network  in  [64],  allows  one  to  detect 
communities  using  a  range  of  resolution  parameter  values  while  enforcing  some 
consistency  in  clustering  identical  nodes  similarly  across  slices.  The  strength  of 
this  enforcement  becomes  larger  as  one  increases  to.  To  optimize  the  multi-slice 
modularity  (3.1),  we  use  a  Louvain-like  locally-greedy  algorithm  [9,50],  which  is 
explained  in  Section  4.1.  In  our  tests,  we  use  the  GenLouvain  code  (in  Matlab) 
from  Ref.  [50].  The  GenLouvain  code  is  a  modified  implementation  of  the  Louvain 
locally  greedy  algorithm  [9]  so  that  it  applies  to  the  cases  with  arbitrary  values  of 
resolution  parameter,  but  it  was  not  optimized  for  speed. 

3.2  LAPD  Field  Interview  Data 

In  [91],  we  use  data  with  both  geographic  and  social  information  about  stops 
involving  street  gang  members  in  the  Los  Angeles  Police  Department  (LAPD) 
Division  of  Hollenbeck  [92],  We  optimize  multi-slice  modularity  (3.1)  as  a  means 
of  unsupervised  clustering  of  individual  gang  members  without  prior  knowledge 
of  the  number  of  gangs  or  affiliation  of  the  members.  We  subsequently  examine 
network  diagnostics  over  slices  to  attempt  to  estimate  the  number  of  gangs  that 
is  stable  across  multiple  resolution-parameter  values  and  that  also  corresponds 
roughly  to  the  number  expected  by  the  LAPD. 

3.2.1  Data  description 

The  numerical  experiment  is  tested  on  the  field  interview  data  from  Hollenbeck 
provided  by  LAPD.  It  records  the  incidents  of  gang  members  being  stopped  while 
meeting  with  each  other.  In  total  there  are  748  individuals  involved,  who  belong 


24 


Q  O 


<*> 

O  O 


Figure  3.2:  Hollenbeck  gang  member  distribution. 

to  31  different  gangs.  As  shown  in  Figure  3.2,  the  748  colored  dots  represent 
all  the  interviewed  members  distributed  in  Hollenbeck  according  to  their  average 
location  across  all  the  recorded  events,  with  the  color  indicating  their  affiliation. 

Consider  a  graph  with  each  node  representing  a  member  interviewed.  We  have 
a  social-connection  adjacency  matrix  S  of  size  748  x  748  satisfying  S[i,  j]  =  1  when 
the  members  associated  with  nodes  n,  and  rij  have  been  recorded  meeting  with 
each  other  at  some  point,  and  S [i,j]  =  0  otherwise,  (S [i,  z]  =  1).  Additionally 
there  is  a  geographical  matrix  G  of  the  same  size  satisfying: 

G [i,j]  =  exp(-dist2(i,j)/cr2) , 

which  is  derived  from  the  location  information  of  each  individual.  The  quantity 
dist(i,j )  denotes  the  physical  distance  between  the  individual  nd s  average  loca¬ 
tion  across  all  the  recorded  events  and  that  of  rij .  The  term  cr  is  a  normalizing 
parameter  which  is  hand  picked  empirically.  The  gang-affiliation  information,  i.e. 
the  ground  truth  (GT),  of  each  gang  member  is  also  known.  Each  individual 
belongs  to  a  unique  gang.  The  goal  is  to  find  the  community  structures  of  these 
748  people  based  on  the  matrixes  S  and  G,  without  any  prior  knowledge  about 
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the  their  affiliation  information  or  the  number  of  gangs. 


3.2.2  Geographical  and  social  matrix 

Define  a  similarity  (weight)  matrix  W  which  linearly  combines  both  the  social 
connection  S  and  the  geographical  information  G: 

W  :=  a;S  +  (1  —  a)G  . 

To  implement  the  multi-slice  modularity  framework,  let  the  subgraph  Ws  for  the 
s-th  slice  to  be  the  same  as  W  but  associated  with  a  different  value  of  resolution 
parameter  =  0.4  + 0.1s,  s=l,2,...,40.  In  other  words,  the  multi-slice  network  we 
are  considering  consists  of  40  slices,  and  each  one  of  them  is  a  copy  of  W  indexed 
by  a  different  %.  The  slices  are  ordered  according  to  the  values  of  %.  We  connect 
each  node  in  every  slice  to  the  corresponding  nodes  in  neighboring  slices,  i.e.  we 
set  Cjsr  G  {0,ca}  and  it  equals  to  u  if  and  only  if  s  and  r  are  consecutive  slice 
index  (|s  —  r|  =  1).  This  way  one  can  detect  communities  simultaneously  over  a 
range  of  resolution  parameters,  while  still  enforcing  some  consistency  in  clustering 
identical  nodes  similarly  across  slices. 

For  a  specific  fixed  a,  the  resulting  partition  of  the  multi-slice  network  using 
a  Louvain-like  algorithm  [9,50]  straightforwardly  yields  partitions  on  each  slice. 
Each  such  partition  is  a  clustering  on  the  group  of  748  interviewed  members. 
Network  diagnostics,  such  as  the  number  of  clusters,  purity,  and  zRand-score ,  are 
examined  on  these  clusterings  to  understand  the  network  structure  by  comparing 
to  the  ground  truth  (GT)  gang  affiliation. 

As  mentioned  before,  the  modularity  optimization  determines  the  number  of 
clusters  as  part  of  its  output.  The  quantities  purity  and  zRand-score  are  used 
to  measure  how  “similar”  one  clustering  is  compared  to  the  ground  truth.  To 
compute  purity  [42],  we  assign  to  all  nodes  in  a  given  community  the  label  of  the 
gang  that  appears  the  most  often  in  that  group  (in  case  of  a  tie  between  two  or 
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(a)  a=0  (b)  a=0.4 


(c)  a=0.8 

Figure  3.3:  Partitioning  results  using  the  multi-slice  modularity  optimization, 
with  uj  —  1  and  W  =  aS  +  (1  —  a)G.  The  number  of  clusters  (Nc),  zRand-score 
(zR)  and  purity  (Prt)  for  the  partition  of  each  slice  are  plotted  as  functions  of  the 
resolution  parameter  ys.  Different  values  of  a  is  used:  (a)  a  =  0,  (b)  a  =  0.4,  (c) 
a  =  0.8. 

more  gangs,  the  label  of  one  of  these  gangs  is  arbitrarily  chosen  for  all  the  nodes 
in  that  group).  The  purity  is  then  given  by  the  fraction  of  the  correctly  labeled 
individuals  in  the  whole  data  set.  The  value  of  purity  ranges  from  zero  to  one, 
with  one  indicating  a  good  match. 

The  quantity  zRand-score  is  a  standardized  pair  counting  method  measuring 
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the  similarity  of  two  partitions.  To  obtain  the  zRand-score,  one  computes  the 
number  of  pairs  w  of  individuals  who  belong  to  the  same  gang  and  who  are  placed 
in  the  same  community  by  a  clustering  algorithm.  One  can  then  compare  this 
number  to  its  expected  value  under  a  hypergeometric  distribution  with  the  same 
number  and  sizes  of  communities.  The  zRand-score,  which  is  normalized  by  the 
standard  deviation  from  the  mean,  indicates  how  far  the  actual  w  value  lies  in  the 
tail  of  a  distribution  [87].  There  is  no  upper  bound  for  the  value  of  zRand-score, 
and  higher  value  indicates  a  good  clustering. 

In  Figure  3.3,  the  number  of  clusters  (Nc),  zRand-score  (zR)  and  purity  (Prt) 
for  the  partition  of  each  slice  are  plotted  as  functions  of  the  resolution  parameter 
7S .  The  inter-slice  coupling  parameter  is  set  to  be  a  constant  u  =  1.  Small 
variations  in  the  value  of  u  did  not  qualitatively  change  the  result. 

A  plateau  in  the  number  of  clusters  curve  may  indicate  a  relatively  stable 
status  of  the  clustering  and  hence  yields  a  reasonable  number  of  gangs.  Therefore, 
from  Figure  3.3  we  seek  plateaus  in  the  number  of  clusters  that  are  near  a  local 
maximum  of  the  zRand-score.  The  details  differ  slightly  for  different  a,  but  the 
general  picture  that  arises  is  that  the  optimal  number  of  clusters  for  our  data  lies 
around  18  clusters  with  a  resulting  z-Rand  score  of  about  180.  Purity  is  again 
roughly  constant  and  again  near  0.5.  Note,  however,  that  comparing  the  purity 
scores  of  two  different  partitions  with  different  numbers  of  clusters  is  not  very 
meaningful,  as  purity  is  biased  to  favor  partitions  with  more  clusters. 

3.2.3  Ground  truth  geographical  matrix 

From  the  previous  experiment  shown  in  Figure  3.3,  we  observe  that  changing 
the  value  of  a  G  [0, 1]  does  not  have  a  big  influence  on  the  resulting  purity  and 
zRand-score.  This  suggests  that  the  social  component  of  the  data  is  too  sparse  to 
significantly  improve  the  clustering. 


To  test  whether  an  improvement  might  be  expected  at  all  if  more  information 
on  social  interactions  becomes  available  in  the  future,  we  replace  the  social  matrix 
S  by  the  ground  truth  matrix  T  in  formulating  the  similarity  matrix: 

Wgt  =  oT£  +  (1  —  a)G  . 

The  ground  true  matrix  T  is  derived  from  the  known  affiliation  of  each  interviewed 
member:  T [i,j]  =  1  if  and  only  if  nl  and  rij  belong  to  the  same  gang  or  i  —  j,  and 
T [i,j\  =  0  otherwise.  Additionally,  the  nonzero  entry  of  T  is  randomly  switched 
to  zero  with  probability  0.1.  Therefore  we  are  actually  using  90%  of  the  ground 
truth  in  the  probability  sense. 

The  multi-slice  network  is  subsequently  constructed  similar  to  the  previous 
section.  It  consists  of  200  slices,  and  each  of  them  is  a  copy  of  Wgt  associated 
with  a  different  value  of  the  resolution  parameter  s  =  1,  2, . . . ,  200.  The  slices 
are  ordered  according  to  the  value  of  %.  In  the  tests  shown  in  Figure  3.4  (a), 
(b)  and  (d),  the  resolution  parameter  takes  value  %  =  0.5  +  0.02s,  while  in  (c) 
%  =  2  +  0.02s. 

In  Figure  3.4,  the  number  of  clusters  (Nc),  zRand-score  (zR)  and  purity  (Prt) 
are  plotted  as  functions  of  the  resolution  parameter  for  partitions  on  each  slice. 
One  can  observe  that  when  a  —  1,  which  means  only  T  (90%  GT)  contributes, 
the  Nc  perfectly  picked  up  the  number  31.  As  a  decreases,  more  geographical 
information  is  involved  instead  of  the  ground  truth,  and  there  are  increasing  un¬ 
stableness  in  the  plots,  similar  to  the  real  data  which  is  sparse  and  noisy.  For 
ol  =  0.8,  one  can  see  that  there  is  a  turning  point  in  the  Nc  curve  around  =  4.9 
which  corresponds  to  Nc=31.  The  number  of  clusters  increases  progressively  until 
7S  =  4.9,  where  it  starts  increasing  dramatically.  The  zRand-score  also  starts  to 
decline  right  after  that  point.  This  suggests  that  the  network  breaks  into  small 
pieces  which  no  longer  have  significant  structure  meaning  after  %  =  4.9.  Hence 
the  partition  corresponding  to  this  point  may  give  us  interesting  information. 
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(a)  a=l 


(b)  a=0.8 


(c)  a=0.8  (d)  a=0.4 

Figure  3.4:  Partitioning  results  using  the  multi-slice  modularity  optimization,  with 
uj  —  1  and  Wgt  =  ccT  +  (1  —  a) G.  The  number  of  clusters  (Nc),  zRand-score 
(zR)  and  purity  (Prt)  for  the  partition  of  each  slice  are  plotted  as  functions  of  the 
resolution  parameter  ys.  Different  values  of  a  is  used:  (a)  a  =  1,  (b)  a  =  0.8,  (c) 
a  =  0.8,  and  (d)  a  =  0.4. 

The  specific  partition  corresponding  to  a  =  0.8,  =  4.9  is  visualized  in  Figure 

3.5.  Each  pie  corresponds  to  a  gang,  and  the  location  of  each  pie  is  the  true 
location  of  that  gang  (on  average).  The  color  indicates  the  community  assignment 
of  the  partition.  Most  of  the  gangs  are  successfully  picked  up  by  the  partition. 
Therefore,  the  resolution  parameter  value  around  =  4.9  yields  a  good  clustering 
with  the  number  of  clusters  close  to  the  ground  truth.  This  experiment  shows  that 
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more  social  information  indeed  helps  to  improve  the  performance  of  the  method. 


Figure  3.5:  Visualization  of  the  partitioncorresponding  to  a  =  0.8,  %  =  4.9.  Each 
pie  represents  one  gang,  placed  according  to  actual  geographical  information.  The 
color  indicates  partition  assignments.  Lines  connect  pies  of  the  same  color. 


3.3  Cow  Image 

The  modularity  function  is  mostly  studied  in  the  network  science  for  community 
detection,  rarely  in  image  processing.  In  the  few  literatures  where  it  is  imple¬ 
mented  on  imaging,  such  as  in  [58],  only  a  fixed  scale  (7  =  1)  has  been  briefly 
studied.  In  this  section,  the  multi-slice  modularity  is  implemented  on  the  task 
of  unsupervised  image  segmentation  to  explore  the  components  of  an  image  at  a 
range  of  multiple  scales.  This  work  is  presented  in  the  paper  [47]. 

The  experiment  is  performed  on  a  cow  image  showed  in  Figure  3.6,  which 
contains  about  3  x  104  pixels.  The  goal  is  to  segment  the  image  without  specifying 
the  number  of  image  components.  A  graph  is  built  for  this  image  in  which  each 
node  corresponds  to  a  pixel  and  each  edge  indicates  the  similarity  between  a  pair 
of  pixels.  We  associate  a  3  x  3  pixel-neighbor  patch  with  each  pixel  2  in  the  image. 
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Figure  3.6:  Image  of  a  pair  of  cows,  which  we  downloaded  from  the  Microsoft 
Research  Cambridge  Object  Recognition  Image  Database  (copyright  ©  2005  Mi¬ 
crosoft  Corporation).  It  is  cropped  from  the  original  image  to  produce  the  seg¬ 
mentation  in  Figure  3.7. 

By  stacking  together  the  three  channels  (RGB)  of  the  pixel-neighbor  patch,  one 
obtains  a  27-dimensional  feature  vector  denoted  by  v,  for  each  pixel  i.  Let  po(i,  j) 


denote  the  L2  norm  of  the  difference  of  patches  corresponding  to  the  pixels  i  and 
j,  i.e.: 


Pd(lj)  :=  |k  -  Vj ||2 


The  similarity  matrix  W.,  that  is  used  in  each  layer  of  the  multi-slice  network 
is  set  to  be  the  same  with  elements: 


(3.2) 


where  r(i)  is  the  30th  smallest  pr>  between  pixel  i  and  other  pixels  [99],  known 
as  local  scaling.  By  using  the  pixel-neighbor  patch  in  building  the  graph,  more 
non-local  and  texture  information  is  covered. 

We  construct  a  multislice  network  that  consists  of  six  copies  of  A.  We  associate 
the  resolution  parameter  value  =  0.04s  —  0.03  with  slice  s  G  {1, ...  ,6}.  We 
then  optimize  multislice  modularity  and  obtain  the  image  segmentations  shown 
in  Fig.  3.7.  (Color  indicates  group  assignments.)  With  this  procedure,  we  are 
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(c)  (d) 

Figure  3.7:  (a)  Segmentation  of  the  cow  image  in  Fig.  3.6  obtained  using  op¬ 
timization  of  multislice  modularity  with  interslice  coupling  parameter  u  =  0.3. 
The  horizontal  axis  shows  the  slice  index  s  €  {1, . . . ,  6},  which  has  an  associated 
resolution-parameter  value  of  =  0.04s  —  0.03.  The  vertical  axis  gives  the  sorted 
pixel  index.  Color  in  each  vertical  stripe  indicates  the  community  assignments 
of  the  pixels  in  the  corresponding  network  slice.  We  also  show  the  segmentation 
that  we  obtain  in  the  images  for  (b)  =  0.05,  (c)  =  0.13,  and  (d)  =  0.21. 


33 


able  to  identify  all  four  components  of  the  image.  As  indicated  in  panel  (a),  we 
obtain  smaller-scale  communities  (i.e.,  groups  of  pixels)  as  we  increase  the  value 
of  the  resolution  parameter.  Importantly  (see  the  discussion  in  Section  3.1),  the 
coupling  between  slices  enforces  some  consistency  in  clustering  identical  nodes 
similarly  across  slices.  In  panels  (b)  and  (c),  we  observe  a  good  segmentation  of 
the  two  cows,  the  sky,  and  the  background  grass.  As  indicated  in  panel  (d),  the 
three  groups  corresponding  to  the  two  cows  and  the  sky  stay  relatively  stable, 
but  the  group  corresponding  to  the  grass  breaks  down  by  the  sixth  slice.  The 
computational  time  using  the  GenLouvain  is  about  3  hours.  This  inspired  us  to 
develop  more  efficient  and  scalable  algorithms  in  Chapter  4  and  5. 

When  building  a  graph  for  an  image  with  each  pixel  being  a  node,  the  size  of 
the  graph  can  easily  get  very  large.  This  poses  challenges  upon  both  computing 
the  similarity  matrix  and  storing  it  in  computer  memory.  To  deal  with  the  mem¬ 
ory  limits,  people  usually  consider  a  sparse  similarity  matrix  which  has  nonzero 
entries  at  the  order  of  O(N),  instead  of  a  full  one  which  is  0(N2).  To  make  a 
similarity  matrix  sparse,  one  common  choice  known  as  the  K-Nearest-Neighbor 
(KNN)  method  is  to  keep  K  edges  with  the  largest  weight  for  each  node  and  set 
the  others  to  zero,  where  K<<  N. 

To  compute  a  KNN  similarity  matrix  straightforwardly  is  still  a  computation¬ 
ally  intensive  task.  Because  it  needs  to  compute  0(N2)  many  pair-wise  similarity 
to  ford  the  K  largest  ones  for  each  node.  Thus  it  does  not  scale  well  when  N  be¬ 
comes  large.  For  example,  it  takes  60  hours  to  straightforwardly  build  the  KNN 
similarity  matrix  for  the  cow  image  test.  To  tackle  this  difficulty,  one  option  is 
to  randomly  sample  a  small  subset  of  all  the  pixels,  and  compute  the  K  largest 
similarity  among  them  instead  of  the  whole  set  of  pixels.  In  the  cow  image  test, 
the  time  cost  can  be  reduced  to  about  2  hours  by  such  sub-samplying  without  af¬ 
fecting  the  performance  significantly.  Empirically  if  the  randomly  sampled  subset 
of  pixels  can  reasonably  represent  the  whole  pixel  set  (in  terms  of  their  associ- 
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ated  neighbor  patch),  then  the  KNN  similarity  matrix  produced  this  way  would 
not  affect  the  performance  much.  A  more  sophisticated  heuristic  for  finding  the 
KNN  pixel-patch  by  random  sampling  is  proposed  in  [6],  which  achieved  good 
performance  in  industrial  application.  In  Chapter  5,  a  Nystrom  method  is  used 
to  improve  the  efficiency  of  constructing  large  graph. 

This  application  of  multi-slice  modularity  on  image  segmentation  is  computa¬ 
tionally  expensive  in  general  due  to  the  large  number  of  pixels.  It  takes  a  lot  of 
computational  memory  and  time  to  run  the  optimization  using  more  slices,  which 
we  would  like  to  do  in  order  to  investigate  how  the  segmentation  evolves  over  a 
larger  range  of  resolution  values.  Computational  improvements  will  be  necessary 
to  conduct  more  detailed  analysis. 
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CHAPTER  4 


A  Variational  Method  for  Modularity 
Optimization 


Modularity  optimization  is  a  widely  adopted  method  in  network  science  for  the 
task  of  community  detection.  By  maximizing  modularity,  one  expects  to  obtain 
a  reasonable  partitioning  of  a  network.  Recall  that  the  modularity  of  a  partition 
g  is  defined  as 

«»>  =  ^  E  “  7S)  S(9t'Si) '  (4'1) 

i,3= 1  V  7 

One  can  refer  to  Chapter  2  for  detailed  definitions  and  an  interpretation  of  the 
modularity  function. 

In  Chapter  3,  we  have  explored  several  interesting  applications  of  the  multi¬ 
slice  modularity  model.  In  general,  the  modularity  optimization  method  has 
achieved  a  lot  of  success  in  network  science.  However,  this  maximization  prob¬ 
lem  is  NP  hard  [12],  hence  considerable  effort  has  been  put  into  the  development 
of  computational  heuristics  to  obtain  network  partitions  with  high  values  of  Q. 
The  background  for  modularity  optimization  algorithms  is  briefly  introduced  in 
Section  4.1. 

However,  these  approaches  are  either  heuristic  and  not  mathematically  well 
founded  (such  as  the  Louvain-like  algorithm),  or  not  efficient  and  accurate  enough 
for  large  scale  datasets  (such  as  the  spectral  methods).  This  chapter  is  aiming 
at  developing  an  alternative  approach  to  solving  the  modularity  optimization  in¬ 
spired  by  variational  methods  from  PDE  and  image  processing,  such  that  the 
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proposed  approach  is  both  efficient  and  mathematically  well  founded.  In  Section 
4.2  certain  variational  methods  in  Euclidean  space  relating  to  clustering  problems 
are  introduced.  Starting  from  Section  4.3,  the  modularity  optimization  problem  is 
posed  as  an  energy  minimization  problem  in  the  graph  setting;  then  a  variational 
scheme  is  proposed  to  solve  for  the  minimization  problem  using  a  graph  version 
of  the  MBO  scheme.  Numerical  experiments  are  presented  after  that. 

4.1  Background  for  Modularity  Optimization  Algorithms 

Numerous  methods  have  been  proposed  [33,76]  to  approximately  solve  modularity 
optimization.  These  include  greedy  algorithms  [23,67],  extremal  optimization 
[11,28],  simulated  annealing  [40,51],  spectral  methods  (which  use  eigenvectors  of 
a  modularity  matrix)  [68,79],  and  more.  The  locally  greedy  Louvain- like  algorithm 
by  Blondel  et  al.  [9]  is  arguably  the  most  popular  computational  heuristic;  it  is  a 
very  fast  algorithm,  and  it  also  yields  high  modularity  values  [33,53]. 

Section  4.6.3  will  discuss  more  details  about  the  spectral  method,  where  the 
leading  eigenvectors  (associated  with  the  largest  eigenvalues)  of  a  so-called  modu¬ 
larity  matrix  B  =  W  —  P  are  used  to  approximate  the  modularity  function  Q.  In 
the  modularity  matrix,  P  is  the  probability  null  model  and  Pl3  =  is  the  NG 
null  model  with  7  =  1.  Ref.  [68]  gives  a  very  neat  mathematical  derivation  of  the 
spectral  method  for  bipartitioning,  while  the  authors  of  [79]  generalized  it  to  the 
tripartitioning  scenario. 

The  remaining  of  this  section  gives  a  brief  description  of  the  Louvain-like 
algorithm  [9].  Note  that  the  original  version  of  this  algorithm  only  applies  to  the 
particular  scenario  of  7  =  1.  Some  part  of  the  algorithm  efficiency  depends  on 
the  fact  7  =  1.  The  GenLonvain  code  from  [50]  generalizes  it  to  scenarios  with 
arbitrary  values  of  7. 

The  algorithm  alternates  between  the  following  two  phases: 
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1.  Assume  we  have  a  weighted  graph  with  N  nodes  at  the  current  stage.  Let 
each  node  to  be  a  distinctive  cluster  by  itself.  For  each  node  nl ,  merge 
it  to  its  neighbor  n/s  cluster  that  gives  the  maximum  positive  increase  in 
modularity.  If  none  of  rq’s  neighbor  nodes  can  give  a  positive  increase  in 
modularity  via  this  process,  then  leave  n*’s  group  assignment  unchanged. 
Perform  this  process  for  every  node  in  a  random  order  and  repeat  it  until 
no  more  change  can  be  made.  Note  that  the  order  of  nodes  in  this  process 
would  affect  the  outcome,  but  according  to  [9]  it  is  empirically  insignificant. 

2.  At  this  phase  a  new  network  is  constructed  where  each  new  node  corresponds 
to  a  cluster  from  the  previous  phase.  The  link  between  a  pair  of  new  nodes 
is  the  sum  of  all  the  edge  weights  between  the  two  associated  clusters.  Self- 
loop  may  occur  in  this  new  network.  After  this  phase,  the  total  number  of 
clusters  is  reduced. 

While  iterating  between  the  above  two  phases,  the  number  of  clusters  decreases 
and  the  modularity  score  increases.  The  algorithm  terminates  when  no  more 
change  can  be  made  and  the  modularity  achieves  a  (local)  maximum  value. 

4.2  Variational  Method 

The  optimization  problem  of  a  graph  quality  function  concerning  clustering  is 
usually  NP  hard.  In  order  to  develop  efficient  algorithms  with  reasonable  com¬ 
putational  costs  for  the  optimization,  people  have  explored  various  approaches  to 
approximately  solve  it.  To  further  study  the  clustering  problem,  we  explore  an 
alternative  approach  to  the  optimization  problem  which  is  mathematically  well 
founded  as  well  as  computationally  efficient.  In  this  section,  the  background  of 
certain  variational  methods  in  Euclidean  space  relating  to  clustering  problems  is 
introduced;  and  in  the  following  sections,  we  present  a  graph  version  variational 
method  for  modularity  optimization. 
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In  the  analysis  of  PDE  inn  Euclidean  space,  a  variational  method  refers  to 
using  calculus  of  variation  techniques  to  find  the  minimizer  (or  maximizer)  of  a 
functional  (energy)  over  certain  function  spaces.  Two  variational  methods  that 
are  closely  related  to  the  clustering  problem  studied  in  this  work  are  introduced: 
the  Ginzburg-Landau  functional  and  the  MBO  scheme. 

4.2.1  Ginzburg-Landau  functional 

The  Ginzburg-Landau  functional  is  a  widely  used  energy  functional  for  diffuse 
interface  models  in  Euclidean  space.  In  physics,  given  a  region  of  particles,  the 
Ginzburg-Landau  functional  models  a  competition  between  two  phases  of  the  par¬ 
ticles.  The  minimizer  of  the  Ginzburg-Landau  functional  indicates  a  stable  status 
of  particles’  phases.  At  this  status,  the  region  is  separated  into  two  subgroups 
according  to  the  two  phases,  between  which  there  is  a  phase  transition  region 
(soft  boundary).  This  induced  separation  is  essentially  a  segmentation  of  the  re¬ 
gion  and  therefore  the  Ginzburg-Landau  functional  is  related  to  our  clustering 
problem. 


Figure  4.1:  Illustration  of  a  double  well  function  W(f )  =  /2(/  —  l)2  with  two 
equilibrium  points  at  zero  and  one. 

The  definition  of  the  Ginzburg-Landau  functional  in  Euclidean  space  is  given 
as: 

GL(f)  =  |  J \Vf\2dx  +  1  J  W(f)dx ,  (4.2) 
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where  W(f)  is  a  double  well  potential  function  satisfying  W(f)  =  /2(/  — l)2.  The 
goal  is  to  find  a  minimizer  /*  of  GL(f).  As  illustrated  in  Figure  4.1,  the  double 
well  functional  f  W(f)dx  has  two  minimizers  at  /  =  0  and  /  =  1.  Therefore,  by 
minimizing  the  double  well  function,  the  value  of  /  is  enforced  to  move  towards 
either  zero  or  one.  Thus  a  soft  phase  separation  is  induced.  The  first  term 
f  \V f\2dx  is  a  regularizer,  upon  minimizing  which  the  smoothness  and  regularity 
of  the  function  /  is  favored.  The  small  parameter  e  represents  the  diffuse  interface 
spacial  scale,  which  relates  to  how  wide  the  transition  region  is. 

The  differentiability  of  the  Ginzburg-Landau  functional  allows  simple  algo¬ 
rithms  for  minimization.  To  minimizer  the  energy  GL(f),  one  can  use  a  gradient 
descent  method  in  the  L-2  sense: 

fr=~GL(f)  =  eAf--Wy),  (4.3) 

where  the  quantity  r  is  the  time  evolving  parameter.  Note  that  the  PDE  equa¬ 
tion  (4.3),  the  L2  gradient  flow  of  the  Ginzburg-Landau  energy,  is  the  Allcn-Cahn 
equation.  According  to  [48],  the  Allcn-Cahn  equation  converges  to  motion  by 
mean  curvature,  which  is  a  gradient  flow  of  the  total  variation  semi-norm.  The 
Ginzburg-Landau  functional  has  been  used  as  an  alternative  for  the  total  vari¬ 
ational  semi-norm  in  image  processing  (see  the  references  in  Ref.  [8])  due  to  its 
T-convergence  to  the  TV  of  characteristic  functions  in  Euclidean  space  [52],  Be¬ 
cause  the  Ginzburg-Landau  functional  is  not  convex,  one  can  only  achieve  local 
minima  using  the  gradient  descent  method. 

4.2.2  MBO  scheme 

An  alternative  approach  to  approximating  motion  by  mean  curvature  of  an  in¬ 
terface  in  Euclidean  space  is  introduced  by  the  authors  of  [62,63]:  an  efficient 
algorithm  known  as  the  Merriman-Bence-Osher  (MBO)  scheme  using  threshold 
dynamics.  The  general  procedure  of  the  MBO  scheme  alternates  between  solving 
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a  linear  heat  equation  and  thresholding.  Because  motion  by  mean  curvature  is  a 
gradient  flow  of  the  TV  semi-norm,  the  goal  of  the  MBO  scheme  is  to  approxi¬ 
mately  solve: 

^  =  -||/Itv  =  -|/|V/Mx.  (4.4) 

An  L2  relaxation  of  the  equation  (4.4)  is: 

%=-j,S  |v/|2&=A/-  <4'5> 

which  is  the  heat  equation.  The  equation  (4.5)  is  a  tight  approximation  when  /  is 
close  to  a  indicator  function,  because  |V/|  =  |V/|2  when  /  only  takes  value  zero 
or  one. 

Starting  from  the  indicator  function  f  =  Xa,  the  MBO  scheme  first  propagates 
/  according  to  the  heat  equation  (4.5)  by  a  small  time  step.  After  that,  the 
function  /  is  no  longer  a  binary  indicator  function  and  the  L2  relaxation  is  no 
longer  tight.  Therefore,  the  MBO  scheme  subsequently  performs  a  hard  threshold 
on  /  such  that  the  value  of  /  switches  to  one  if  /  >  0.5,  and  zero  otherwise. 
Consequently,  /  becomes  an  indicator  function  again,  with  the  corresponding 
subset  modified  slightly. 

In  summary,  the  MBO  scheme  alternates  between  the  two  steps:  propagating  / 
according  to  the  heat  equation  by  a  small  time  step;  and  thresholding  /  to  become 
an  indicator  function.  Empirically,  the  MBO  scheme  converges  very  quickly,  and 
the  computation  cost  for  each  iteration  is  low  because  the  thresholding  step  is 
a  simple  operation.  Mathematical  analysis  has  been  explored  to  understand  the 
theoretical  connection  between  the  MBO  scheme  and  motion  by  mean  curvature. 
It  turns  out  the  MBO  scheme  can  be  proven  to  converge  to  motion  by  mean 
curvature,  in  the  sense  of  T- convergence.  See  references  [5,30,31]  for  discussions 
of  the  convergence  of  the  MBO  scheme. 

The  MBO  scheme  has  a  close  connection  to  minimizing  the  Ginzburg-Landau 
functional  with  a  gradient  descent  (4.3).  After  all,  they  are  both  approximating 
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motion  by  mean  curvature,  i.e.  a  gradient  flow  of  the  total  variation  semi-norm. 
One  interpretation  of  the  connection  is  that  the  MBO  scheme  replaces  the  non¬ 
linear  term  of  the  Allen- Cahn  equation  with  thresholding  [29].  The  non-linear 
term  ^W'(f)  in  (4.3)  is  induced  by  the  double  well  function,  which  enforces  the 
minimizer  /*  to  take  binary  values,  and  thus  poses  a  soft  constraint  for  /  to  be 
close  to  an  indicator  function.  In  the  MBO  scheme  however,  instead  of  using  the 
double  well  term,  the  thresholding  serves  as  a  hard  constraint  to  enforce  /  to  be 
an  exact  indicator  function  at  the  end  of  each  iteration. 


4.3  Reformulation  of  Modularity  Optimization 

In  this  subsection,  we  reformulate  the  problem  of  modularity  optimization  by 
deriving  a  new  expression  for  Q  that  bridges  the  network-science  and  compressive- 
sensing  communities.  This  formula  makes  it  possible  to  use  techniques  from  the 
latter  to  tackle  the  modularity-optimization  problem  with  low  computational  cost. 

We  start  by  recalling  the  total  variation  (TV)  and  defining  a  weighted  Z^-norm, 
and  weighted  mean  of  a  graph  function  /  :  G  — >  R: 

1  N 

I /| tv  :=  x  Wij  -  f{nj) |  , 

i,j= 1 

N 

ll/lltv=E<*.l/WI2.  («) 

2=1 

1  N 

mean (/)  :=  —  d*/ (n* ) . 

2=1 

For  a  vector- valued  function  /  =  (/1?  /2, . . . ,  /„,):  G  — »  Mn,  we  define 

h 

\f\rv  '■=  ^  |/z|rv  > 

i=i 

h 

ii/iiL  :=y  ii/iiiv  (4.7) 

i=i 


42 


and  mean(/)  :=  (mean(/i),  mean(/2), . . . ,  mean(/^)). 

Given  a  partition  g  =  {gi}(Li  defined  in  Section  2.2.1,  let  Ai  =  {rti  G  G,gi  =  l}, 
where  /  G  {1,  2, ... ,  ft}  (ft  <  N ).  Thus,  G  =  U”=1A;  is  a  partition  of  the  network 
(G,  E )  into  disjoint  communities.  Note  that  every  Ai  is  allowed  to  be  empty,  so 
g  is  a  partition  into  at  most  n  communities.  Let  fi  :  G  — >  {0, 1}  be  the  indicator 
function  \At  of  community  /;  in  other  words,  fi{rij)  equals  one  if  r/,:  =  l ,  and  it 
equals  zero  otherwise.  The  function  /  =  (/]_,  /2,  •  •  • ,  fh)  is  then  called  the  partition 
function  (associated  with  g).  Because  each  set  Ai  is  disjoint  from  all  of  the  others, 
it  is  guaranteed  that  only  a  single  entry  of  /(  equals  one  for  any  node  n*.  Therefore, 
/TgPcK",  where 

yft:={(l,0,...,0),(0,l,0,...,0),...,(0,...,0,l)}  =  {el}?=1 

is  the  standard  basis  of  Mn.  It  is  equivalent  to  /  G  B. 

The  key  observation  that  bridges  the  network-science  and  compressive-sensing 
communities  is  the  following: 

Theorem  1.  Maximizing  the  modularity  functional  Q  over  all  partitions  that  have 
at  most  h  communities  is  equivalent  to  minimizing 

\f\rv  —  l\\f  —  mean(/)|||2  (4.8) 

over  all  functions  f  :  G  —>•  Vn . 

Proof.  In  the  language  of  graph  partitioning,  vol(kb)  =  dt  denotes  the 

volume  of  the  set  Ai ,  and  Cut(A;,74j:)  =  ^n,eA(  njeAc  wii  is  graph  cuf  °f  Ai 
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and  Af.  Therefore, 


=  2m 


(2m  -  V  V  I  V  i<ij 

l= 1  \  22j  G-Aj  ,72^  E-Aj 


1 


= 1  -  ^  ( E  Cut^ ^  E  vol(^ 


,  Z=1 


7_ 

2m 


z=i 


=  i-7-  (  ^Cut(A,,A?)  -  E(Evo1(Az)  'vo1(^))  )  ,  (4-9) 


2  m 


2m 


where  the  sum  Ylgi^gj  wij  includes  both  Wij  and  Wji.  As  mentioned  in  Section 
2.2.2,  one  has  IxaItv  =  Cut  (A,  Ac).  Additionally,  one  can  observe  that: 


N 


Wxa  -  mean(xA)Hi2  =  E  di 


2=1 


XaK)  - 


vol(A) 


2  m 


=  vol(A)  1  - 


vol  (A) 
2m 

vol(A)  •  vol  (Ac) 

2m 


+  vol  (Ac) 


vol  (A) 
2m 


(4.10) 


Because  is  the  indicator  function  of  A;,  it  follows  that: 

fl 

l/lrv  -  711/  -  mean(/)||^2  =  {\fi\rv  -  l\\fi  ~  mean(/,)||f,2} 


i=i 


^{c^,^)-7VO'W2mV°'W}.  (4.H) 

Z=1  ^  J 

Combining  (4.9)  and  (4.11),  we  conclude  that  maximizing  Q  is  equivalent  to 
minimizing  (4.8).  □ 


With  the  above  argument,  we  have  reformulated  the  problem  of  modularity 
maximization  as  the  minimization  problem  (4.8),  which  corresponds  to  minimizing 
the  total  variation  (TV)  of  the  function  /  along  with  a  balance  term.  This  yields 
a  novel  view  of  modularity  optimization  that  uses  the  perspective  of  compressive 
sensing  (see  the  references  in  [57]).  In  the  context  of  compressive  sensing,  one 
seeks  a  solution  of  function  /  that  is  compressible  under  the  transform  of  a  linear 
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operator  <f>.  That  is,  we  want  $/  to  be  well-approximated  by  sparse  functions.  (A 
function  is  considered  to  be  “sparse”  when  it  is  equal  to  or  approximately  equal 
to  zero  on  a  “large”  portion  of  the  whole  domain.)  Minimizing  H^/H^  promotes 
sparsity  in  $/.  When  $  is  the  gradient  operator  (on  a  continuous  domain)  or  the 
finite-differencing  operator  (on  a  discrete  domain)  V,  then  the  object  H^/H^  = 
||V/||Ll  becomes  the  total  variation  \f\rv  [57,66].  The  minimization  of  TV  is  also 
common  in  image  processing  and  computer  vision  [21,57,66,80]. 

The  expression  in  equation  (4.9)  is  interesting  because  its  geometric  inter¬ 
pretation  of  modularity  optimization  contrasts  with  existing  interpretations  (e.g., 
probabilistic  ones  or  in  terms  of  the  Potts  model  from  statistical  physics  [68,76]). 
For  example,  we  see  from  (4.9)  that  finding  the  bipartition  of  the  graph  G  =  AuAc 
with  maximal  modularity  is  equivalent  to  minimizing 


Cut  (A,  Ac)  -  -4— vol(A)  •  vol  ( Ac )  . 

2m 

Note  that  the  term  vol(A)  •  vol  ( Ac )  is  maximal  when  vol(A)  =  vol  ( Ac )  =  m. 
Therefore,  the  second  term  is  a  balance  term  that  favors  a  partition  of  the  graph 
into  two  groups  of  roughly  equal  size.  In  contrast,  the  first  term  favors  a  partition 
of  the  graph  in  which  few  links  are  severed.  This  is  reminiscent  of  the  normalized 
cuts  problem  (Section  2.2.1)  in  which  the  objective  is  to  minimize  the  ratio 


Cut  (A,  Ac 


(4.12) 


vol  (A)  •  vol  (Ac) 

In  recent  papers  Refs.  [14,16,17,43,44,77],  various  TV-based  algorithms  are  pro¬ 
posed  to  minimize  ratios  similar  to  (4.12). 


4.4  Algorithm 

Directly  optimizing  (4.8)  over  all  partition  functions  /  :  G  — >  Vn  is  difficult  due 
to  the  discrete  solution  space.  A  continuous  relaxation  is  thus  needed  to  simplify 
the  optimization  problem. 
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4.4.1  Ginzburg-Landau  relaxation  of  the  discrete  problem 

Recall  that 

B  =  {/  I  /  :  G  -+  Vh} 

denotes  the  space  of  partition  functions.  Minimizing  (4.8)  over  B  is  equivalent  to 
minimizing 

j  \f\rv  —  t||/  —  niean(/)||/2 ,  if/ Gl 

H(J)  =  \  f4-14) 

I  +cx) ,  otherwise 

over  all  /  :  G  — >  Mn. 

As  discussed  in  Section  4.2.1,  the  Ginzburg-Landau  (GL)  functional  has  been 
used  as  an  alternative  for  the  TV  term  in  image  processing.  Reference  [8]  de¬ 
veloped  a  graph  version  of  the  GL  functional  and  used  it  for  graph-based  high¬ 
dimensional  data  segmentation  problems.  The  authors  of  reference  [35, 59]  gener¬ 
alized  the  two-phase  graphical  GL  functional  to  a  multi-phase  one. 

Following  the  idea  in  references  [8,35,59],  we  define  the  graph  Ginzburg-Landau 
relaxation  of  H  as  follows: 

n  N 

He{f)  =  2  5^</i,  L-fi )  +  ~2  (/K))  -  711/  -  mean(/)||i2 ,  (4.14) 

where  e  >  0.  In  equation  (4.14),  l'Fmuiti  :  — >  M  is  a  multi-well  potential  (see 
[35,59])  with  equal-depth  wells.  The  minima  of  IL'Vuiti  are  spaced  equidistantly, 
take  the  value  zero,  and  correspond  to  the  points  of  Vn .  The  specific  formula  for 
l'Fmuiti  does  not  matter  for  this  work,  because  we  will  discard  it  when  we  implement 
the  MBO  scheme.  Note  that  the  purpose  of  this  multi-well  term  is  to  force  /(nj) 
to  go  to  one  of  the  minima,  so  that  one  obtains  an  approximate  phase  separation. 

Our  next  theorem  states  that  modularity  optimization  with  an  upper  bound  on 
the  number  of  communities  is  well- approximated  (in  terms  of  T-convergence)  by 
minimizing  He  over  all  /  :  G  — >  Mn.  Therefore,  the  discrete  modularity  optimiza¬ 
tion  problem  (4.8)  can  be  approximated  by  a  continuous  optimization  problem. 
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We  give  the  mathematical  definition  and  relevant  proofs  of  T -convergence  in  Sec¬ 
tion  4.5. 

Theorem  2  (T-convergence  of  He  towards  H).  The  functional  He  T -converges  to 
H  on  the  space  X  —  {/  |  /  :  G  — >•  Mn}. 

Proof.  As  shown  in  Theorem  3  (in  Section  4.5),  Hf  +  7 1 1 / — mean ( f ) |||  T-converges 
to  H  +  7||/  —  mean(/)|||  on  X.  Because  7|| /  —  mean(/)|||  is  continuous  on  the 
metric  space  X,  it  is  straightforward  to  check  that  He  T-converges  to  H  according 
to  the  definition  of  T-convergence.  □ 

By  the  definition  of  T-convergence,  Theorem  2  directly  implies  the  following: 

Corollary  1.  Let  fe  be  the  global  minimizer  of  He.  Any  convergent  subsequence 
of  fe  then  converges  to  a  global  maximizer  of  the  modularity  Q  with  at  most  h 
communities. 

4.4.2  MBO  scheme,  convex  splitting,  and  spectral  approximation 

In  this  subsection,  we  use  techniques  from  the  compressive-sensing  and  image- 
processing  literatures  to  develop  an  efficient  algorithm  that  (approximately)  opti¬ 
mizes  If . 

In  Section  4.2.2,  we  discussed  a  fast  variational  method  called  the  MBO 
scheme.  Inspired  by  the  MBO  scheme,  the  authors  of  reference  [29]  developed 
a  method  using  a  PDE  framework  to  minimize  the  piecewise-constant  Mumford- 
Shah  functional  (introduced  in  [65])  for  image  segmentation.  Their  algorithm  was 
motivated  by  the  Chan-Vese  level-set  method  [21]  for  minimizing  certain  variants 
of  the  Mumford-Shah  functional.  Note  that  the  Chan-Vese  method  is  related  to 
our  reformulation  of  modularity,  because  it  uses  the  TV  as  a  regularizer  along 
with  L2  based  fitting  terms.  The  authors  of  references  [35,  59,  60]  applied  the 
MBO  scheme  to  graph-based  problems. 
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The  gradient-descent  equation  of  (4.14)  is 

^  =  — (L./i,  •  •  • ,  L /a)  -  ^  (7|| /  -  mean(/)||U  ,  (4.15) 

where  :  G  — *  M"  is  the  composition  of  the  functions  J^Wmuiti  and 

/.  Thus,  one  can  follow  the  idea  of  the  original  MBO  scheme  to  split  (4.15) 
into  two  parts  and  replace  the  forcing  part  ^  =  —  ^^bFmuiti(/)  by  an  associated 
thresholding. 

We  propose  a  Modularity  MBO  scheme  that  alternates  between  the  following 
two  primary  steps  to  obtain  an  approximate  solution  fn  :  G  — >  Vn : 

Step  1. 

A  gradient-descent  process  of  temporal  evolution  consists  of  a  diffusion  term 
and  an  additional  balance  term: 

^  =  — (L/i,  •  •  • ,  L fn)  +  ^  (7II/  -  mean(/)|||2)  .  (4.16) 

We  apply  this  process  on  fn  with  time  step  rn,  and  we  repeat  it  for  r/  time 
steps  to  obtain  /. 

Step  2. 

We  threshold  /  from  Rn  into  Vn : 

/n+1(ni)  =  4  G  Vn  ,  where  gt  =  argma x{1<,<ft}{/{(ni)}  • 

This  step  assigns  to  /n+1(nj)  the  node  in  that  is  the  closest  to 

To  solve  (4.16),  we  implement  a  convex- splitting  scheme  [32,94],  Equation 
(4.16)  is  the  gradient  flow  of  the  energy  Hi  +  H2,  where  Hi(f )  :=  \  Y^i=\ (./),  L /)) 
is  convex  and  H2(f)  :=  — 7||/  — mean(/)|||,)  is  concave.  In  a  discrete-time  stepping 
scheme,  the  convex  part  is  treated  implicitly  in  the  numerical  scheme,  whereas 
the  concave  part  is  treated  explicitly.  Note  that  the  convex-splitting  scheme  for 
gradient- descent  equations  is  an  unconditionally  stable  time-stepping  scheme. 
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The  discretized  time-stepping  formula  is 


f  —  fn  d  „  d 

-  TfH2{fn) 

Tn  dj  df 

=  — (L/l,  ....  L/ft)  +  27d  0  (/"  -  mean(D) ,  (4.17) 

where  (d  0  /)(rq)  :=  d*/(nj),  f  :  G  Mn,  (d*  is  the  strength  of  node  77 ),  and 
/n  :  G  — »  Vn.  At  each  step,  we  thus  need  to  solve 

((1  +  TnL)/i,  •  •  • ,  (1  +  r„L)/n)  =  fn  +  ‘I'fTnd®  [/"  -  mean(/n)]  .  (4.18) 

For  the  purpose  of  computational  efficiency,  we  utilize  the  low-order  (leading) 
eigenvectors  (associated  with  the  smallest  eigenvalues)  of  the  graph  Laplacian  L 
to  approximate  the  operator  L.  The  eigenvectors  with  higher  order  are  more 
oscillatory,  and  resolve  finer  scale.  Leading  eigenvectors  provide  a  set  of  basis 
to  approximately  represent  graph  functions.  The  more  leading  eigenvectors  are 
used,  the  finer  scales  can  be  resolved.  In  the  graph-clustering  literature,  scholars 
usually  use  a  small  portion  of  leading  eigenvectors  of  L  to  find  useful  structural 
information  in  a  graph  [7,22,24,68,81],  (note  however  that  some  recent  work  has 
explored  the  use  of  other  eigenvectors  [25]).  In  contrast,  one  typically  uses  many 
more  modes  when  solving  partial  differential  equations  numerically  (e.g.,  consider 
a  psuedospectral  scheme),  because  one  needs  to  resolve  the  solution  at  much  finer 
scales. 

Motivated  by  the  known  utility  and  many  successes  of  using  leading  eigenvec¬ 
tors  (and  discarding  higher-order  eigenvectors)  in  studying  graph  structure,  we 
project  /  onto  the  space  of  the  Ne ig  leading  eigenvectors  to  approximately  solve 
(4.18).  Assume  that  fn  =  ]CS  and  271yd©  (fn  -  mean(/n))  = 

where  {As}  are  the  A^ig  smallest  eigenvalues  of  the  graph  Laplacian  L. 
We  denote  the  corresponding  eigenvectors  (eigenfunctions)  by  {0S}.  Note  that 
a”,  as,  and  b"  all  belong  to  Mn.  With  this  representation,  we  obtain 

„n  1  un 

*  =  SG {1,2,..., IVeiJ  (4.19) 

1  +  TnXs 
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from  (4.18)  and  are  able  to  solve  (4.18)  more  efficiently. 


Algorithm  1  The  Modularity  MBO  scheme. 

Set  values  for  7,  h,  rj,  and  rn  =  dt. 

Input:  G-  an  initial  function  /°  :  G  — >  Vn  and  the  eigenvalue-eigenvector 
pairs  {(AS,0S)}  of  the  graph  Laplacian  L  corresponding  to  the  Aeig  smallest 
eigenvalues. 

Initialization:  for  n  =  0 

a  °a  =  (f°M\ 

while  fn  ^  /n_1  and  n  <  500:  do 
Diffusion: 
for  i  —  1  — >■  r)  do 

b”  =  (2y dtdO  (fn  -  mean (fn)),  0S); 

for  . ^ 

i=i+l; 

end  for 

Thresholding: 

fn+1(rii)  =  egi  G  Vn ,  where  gt  =  argma x{1<j<ft}{//(ni)}- 

a?+1  =  (/n+1,0.); 

n  =  n  +  1; 

end  while 

Output  the  partition  function  fn. 


We  summarize  our  Modularity  MBO  scheme  in  Algorithm  1.  Note  that  the 
time  complexity  of  each  MBO  iteration  step  is  O(N).  The  choice  of  the  initial 
function  f°  can  have  significant  impact  on  the  outcome  of  the  algorithm,  because 
the  modularity  function  is  not  convex  and  different  initiation  may  lead  to  different 
local  minima.  Unless  specified  otherwise,  the  numerical  experiments  in  this  work 
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using  a  random  initial  function  f°. 


4.4.3  Two  implementations  of  the  Modularity  MBO  scheme 

Given  an  input  value  of  the  parameter  n,  the  Modularity  MBO  scheme  partitions 
a  graph  into  at  most  n  communities.  In  many  applications,  however,  the  number 
of  communities  is  usually  not  known  in  advance  [33,76],  so  it  can  be  difficult  to 
decide  what  values  of  n  to  use.  Accordingly,  we  propose  two  implementations  of 
the  Modularity  MBO  scheme.  The  Recursive  Modularity  MBO  (RMM)  scheme 
is  particularly  suitable  for  networks  that  one  expects  a  large  number  of  commu¬ 
nities,  whereas  the  Multiple  Input-n  Modularity  MBO  (Multi-n  MM)  scheme  is 
particularly  suitable  for  networks  that  one  expects  to  have  a  small  number  of 
communities. 

Implementation  1.  The  RMM  scheme  performs  the  Modularity  MBO 
scheme  recursively,  which  is  particular  suitable  for  networks  that  one  expects 
to  have  a  large  number  of  communities.  In  practice,  we  set  the  value  of  n  to  be 
large  in  the  first  round  of  applying  the  scheme,  and  we  then  let  it  be  small  for  the 
rest  of  the  recursion  steps.  In  the  experiments  that  we  report  in  this  work,  we  use 
n  =  50  for  the  first  round  and  n  =  min(10, 151)  thereafter,  where  |5j  is  the  size  of 
the  subnetwork  that  one  is  partitioning  in  a  given  step.  (We  also  tried  n  =  10,  20 
or  30  for  the  first  round  and  ii  =  min(10,  |5|)  thereafter.  The  results  are  similar.) 

Importantly,  the  minimization  problem  (4.8)  needs  a  slight  adjustment  for 
the  recursion  steps.  Assume  for  a  particular  recursion  step  that  we  perform  the 
Modularity  MBO  partitioning  with  parameter  ii  on  a  network  S  C  G  containing 
a  subset  of  the  nodes  of  the  original  graph.  Our  goal  is  to  increase  the  modularity 
for  the  global  network  instead  of  the  subnetwork  S.  Hence,  the  target  energy  to 
minimize  is 

(5) 

H(S\f)  :=  |/|$.  -  7— —  ||  /  -  mean(5)(/)|$  , 

TYl 
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where  /  :  S  — >  Vn  C  Mn,  the  TV  norm  |  •  is  defined  as: 

\f\rv  '■=  \  wv\f(ni)  ~  /K')U i ; 

ili,nj£S 

the  total  edge  weight  of  S  is  2 =  J2n.&sdi,  and 

mean (5)(/)  =  ^  dJim) . 

rii&S 

The  rest  of  the  minimization  procedures  are  the  same  as  described  previously. 

Note  that  this  recursive  scheme  is  adaptive  in  resolving  the  network  structure 
scale.  The  eigenvectors  of  the  subgroups  are  recalculated  at  each  recursive  step, 
so  the  scales  being  resolved  get  finer  as  the  recursion  step  goes.  Therefore  Neig 
need  not  to  be  very  large. 

Implementation  2.  For  the  Multi-n  MM  scheme,  one  sets  a  search  range 
T  for  ii,  runs  the  Modularity  MBO  scheme  for  each  n  €  T,  and  then  chooses  the 
resulting  partition  with  the  highest  modularity  score.  It  works  well  if  one  knows 
the  approximate  maximum  number  of  communities  and  that  number  is  reasonably 
small.  One  can  then  set  the  search  range  T  to  be  all  integers  between  two  and  the 
maximum  number.  Even  though  the  Multi-n  MM  scheme  allows  partitions  with 
fewer  than  n  clusters,  it  is  still  necessary  to  include  small  values  of  n  in  the  search 
range  to  better  avoid  local  minimums.  (See  the  discussion  of  the  MNIST  “4-9” 
digits  network  in  Section  4. 6. 2.1.)  For  different  values  of  h,  one  can  reuse  the 
previously  computed  eigenvectors  because  n  does  not  affect  the  graph  Laplacian. 
Inputting  multiple  choices  for  the  random  initial  function  f°  (as  described  at  the 
end  of  Section  4.4)  also  helps  to  reduce  the  chance  of  getting  stuck  in  a  minimum 
and  thereby  to  achieve  a  good  optimal  solution  for  the  Modularity  MBO  scheme. 
Because  this  initial  function  is  used  after  the  computation  of  eigenvectors,  it  only 
takes  a  small  amount  of  time  to  rerun  the  MBO  steps.  In  Section  4.6,  we  test 
these  two  schemes  on  several  real  and  synthetic  networks. 
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4.5  T-convergence 


The  notion  of  T-convergence  of  functionals  is  now  commonly  used  for  minimization 
problems.  See  reference  [26]  for  detailed  introduction.  In  this  section,  we  briefly 
review  the  definition  of  T-convergence  and  then  prove  the  claim  that  the  graphical 
multi-phase  Ginzburg-Landau  functional  T-converges  to  the  graph  TV.  This  proof 
is  a  straightforward  extension  of  the  work  in  [89]  for  the  two-phase  graph  GL 
functional. 

Definition  1.  Let  X  be  a  metric  space  and  let  {Fn  :T->lU  {±cx)}}^1  be  a 
sequence  of  functionals.  The  sequence  Fn  T-converges  to  the  functional  F  :  X  — > 
R  U  {±oo}  if,  for  all  f  e  X,  the  following  lower  and  upper  bound  conditions  hold: 


(lower  hound  condition)  for  every  sequence  {fn}ff=i  such  that  fn  — »  f,  we 
have 

F(f)  <  lirninf  Fn(fn) ; 

n— 

(upper  bound  condition)  there  exists  a  sequence  {fn}™=i  such  that 

F{f )  >  limsup  Fn(fn) . 

n— >oo 


Reference  [35, 59]  proposed  the  following  multi-phase  graph  GL  functional: 

1  "  1  N 

GLr"v)  =  5  T.F  Vi>  +  3  E  «'—(/(».)) 

Z  1=1  i= 1 

where  /  :  G  ->•  Mn  and  Wmvati(f(rii))  =  n”=i  II  f(ni)  ~  gIIL- 

Let  X  =  {/  |  /  :  G  — >  Mn}  and  Fe  =  GL°lultl  for  all  e  >  0,  (we  have 
B  =  {/  |  /  :  G  >  Vn }  C  A").  Because  /  can  be  viewed  as  a  matrix  in  MArxn,  the 
metric  for  space  X  can  be  defined  naturally  using  the  L2  norm. 
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Theorem  3.  (  Y- convergence).  The  sequence  Fe  Y -converges  to  F0  as  e  — »  0+ 
where 


Fo(f) 


ES= i  Wij\f{ni)  ~  /K)|li  ,  if  /  e  B  , 

otherwise . 


Proof.  Consider  the  functional  We(f)  =  hFmuiti(/(^i))  an(l 


W0(f)  :  = 


if  /  el, 
otherwise . 


First,  we  show  that  We  T-converges  to  Wo  as  e  — »  0+.  Let  {en}^=1  C  (0,  oo) 
be  a  sequence  such  that  en  — >  0  as  n  — >  oo.  For  the  lower  bound  condition, 
suppose  that  a  sequence  {fn}™=i  satishes  fn  — *  /  as  n  — *  oo.  If  /  G  B,  then 
it  follows  that  Wo(f)  =  0  <  lim  inf^oo  W£n  (fn)  because  We  >0.  If  /  does 
not  belong  to  B,  then  there  exists  i  G  {1,2,. . .  ,N}  such  that  ffui)  Vn  and 
fn(rii )  ->•  finf).  Therefore,  liminf^oo  Wen(fn)  =  +oo  >  W0(f )  =  +oo.  For  the 
upper  bound  condition,  assume  that  /  G  B  and  fn  =  /  for  all  n.  It  then  follows 
that  W0(f)  =  0  >  lirnsup^^  Wen(fn )  =  0.  Thus,  We  T-converges  to  W0. 

Because  Z(f)  :=  \  Y^i=i (/i-L/i)  is  continuous  on  the  metric  space  X,  it  is 
straightforward  to  check  that  the  functional  Ftn  =  Z  +  W£n  satishes  the  lower  and 
upper  bound  condition  and  therefore  T-converges  to  Z  +  Wq. 

Finally,  note  that  Z(f)  =  \f\rv  for  all  /  G  B.  Therefore,  Z  +  W0  =  F0  and 
one  can  conclude  that  F€n  T-converges  to  F0  for  any  sequence  en  — »  0+.  □ 


Theorem  4.  (Compactness).  Let  {en},c^=1  C  R+  be  a  sequence  such  that  en  — >  0  as 
n  — >■  oo.  {fn})f=1  C  X  is  a  sequence  such  that  Fen(fn )  is  uniformly  bounded.  Then 
there  exists  a  subsequence  {f  n'}(f=1  C  {/n}^L1  and  f°°  G  B  such  that  fn'  -G  /°° 
as  n  — >  oo. 
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Proof.  Since  Fen(fn )  is  uniformly  bounded,  we  have 

N 

T  Wmuiti{fn{rii ))  <  Cen,  V  n; 

i=  1 

for  some  constant  C.  Hence  for  large  enough  n,  {fn}™=i  is  bounded  and  therefore 
has  a  convergent  subsequence  {fn'}^l=1  with  limit  f°°. 

WmuHi  is  continuous  =>  Wmuiti(fn'  (nf))  converges  to  Y^?=i  Wmuiti(f°°(ni )) 
=>  Eil  Wmuitiif^irii))  =  0.  Therefore  /°°6  B.  □ 

Since  {/  G  X  :  Fe(/)  <  C}  is  bounded  for  any  fixed  e  and  C,  Theorem  4  proves 
the  equi- coerciveness  of  the  sequence  Ftn.  Combined  with  the  T-convergence,  one 
can  conclude  the  following  theorem  stated  in  Chapter  7  of  [26]. 

Theorem  5.  Let  X  be  a  metric  space  and  {Fn  :  X  — >•  .RUlioo}}^  be  a  sequence 
of  equi- coerciveness  functionals.  Fn  V -converges  to  F  :  X  — >•  RU  {±cxo}.  Then 
there  exists  a  minimizer  of  F  in  X  and  min{F(f )  :  /  e  X}  =  lim  inf,W0C{Fn(/)  : 
/  G  A"}.  Furthermore,  if  {fn}^=1  C  X  is  a  precompact  sequence  such  that 

lim  Fn(fn )  =  lim  inf {Fn(f)  :  f  G  X}, 

n^oo  n— >oc 

then  every  cluster  point  of  this  sequence  is  a  minimizer  of  F. 

4.6  Numerical  Results 

In  this  section,  we  present  the  numerical  results  of  experiments  that  we  conducted 
using  both  synthetic  and  real  network  data  sets.  Unless  otherwise  specified,  our 
Modularity  MBO  schemes  are  all  implemented  in  Matlab,  (which  are  not  opti¬ 
mized  for  speed).  In  the  following  tests,  we  set  the  parameters  of  the  Modularity 
MBO  scheme  to  be  rj  —  5  and  rn  =  1. 
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4.6.1  LFR  benchmark 


In  Ref.  [54],  Lancichinetti,  Fortunato,  and  Radicchi  (LFR)  introduced  an  epony¬ 
mous  class  of  synthetic  benchmark  graphs  to  provide  tougher  tests  of  community- 
detection  algorithms  than  previous  synthetic  benchmarks.  Many  real  networks 
have  heterogeneous  distributions  of  node  degree  and  community  size,  so  the  LFR 
benchmark  graphs  incorporate  such  heterogeneity.  They  consist  of  unweighted 
networks  with  a  predefined  set  of  non-overlapping  communities.  As  described  in 
Ref.  [54] ,  each  node  is  assigned  a  degree  from  a  power-law  distribution  with  power 
£;  additionally,  the  maximum  degree  is  given  by  dmax  and  mean  degree  is  (d). 
Community  sizes  in  LFR  graphs  follow  a  power-law  distribution  with  power  /3, 
subject  to  the  constraint  that  the  sum  of  the  community  sizes  must  equal  the 
number  of  nodes  N  in  the  network.  Each  node  shares  a  fraction  1  —  p,  of  its  edges 
with  nodes  in  its  own  community  and  a  fraction  p  of  its  edges  with  nodes  in  other 
communities.  (The  quantity  p  is  called  the  mixing  parameter .)  The  minimum 
and  maximum  community  sizes,  gmin  and  gmax,  are  also  specified.  We  label  the 
LFR  benchmark  data  sets  by  (. N ,  (d),  dmax,  £,  /?,  p,  gmin,  ?max)-  The  code  used  to 
generate  the  LFR  data  is  publicly  available  provided  by  the  authors  in  [54], 

The  LFR  benchmark  graphs  has  become  a  popular  choice  for  testing  com¬ 
munity  detection-algorithms,  and  Ref.  [53]  uses  them  to  test  the  performance  of 
several  community-detection  algorithms.  The  authors  concluded,  for  example, 
that  the  locally  greedy  Louvain  algorithm  [9]  is  one  of  the  best  heuristics  for  max¬ 
imizing  modularity  based  on  the  evaluation  of  the  normalized  mutual  information 
(NMI)  (discussed  below  in  this  section).  Note  that  the  time  complexity  of  this 
Louvain  algorithm  is  0(M )  [33],  where  M  is  the  number  of  nonzero  edges  in  the 
network.  In  our  tests,  we  use  the  GenLouvain  code  (in  Matlab)  from  Ref.  [50]; 
this  is  an  implementation  of  a  Louvain-like  algorithm.  The  GenLouvain  code  is 
a  modification  of  the  Louvain  locally  greedy  algorithm  [9]  (introduced  in  Section 
4.1)  so  that  it  applies  to  the  cases  with  arbitrary  values  of  resolution  parameter, 
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but  it  was  not  optimized  for  speed.  We  implement  our  RMM  scheme  on  the  LFR 
benchmark,  and  we  compare  our  results  against  the  GenLouvain  code.  We  use 
the  recursive  version  of  the  Modularity  MBO  scheme  because  our  LFR  networks 
contain  about  0.041V  communities. 


We  implement  the  modularity-optimization  algorithms  on  severals  sets  of  LFR 
benchmark  data.  We  then  compare  the  resulting  partitions  with  the  known  com¬ 
munity  assignments  of  the  benchmarks  (i.e.,  the  ground  truth)  by  examining  the 
normalized  mutual  information  (NMI)  [27].  NMI  is  a  similarity  measure  for  com¬ 
paring  two  partitions  based  on  the  information  entropy,  and  it  is  often  used  for 
testing  community-detection  algorithms  [53,54],  The  NMI  equals  one  when  two 
partitions  are  identical,  and  it  has  an  expected  value  of  zero  when  they  are  inde¬ 
pendent.  For  an  N- node  network  with  two  partitions,  C  =  {Ci,  C'2, . . . ,  Cj< }  and 
C  =  {Ci,  C'2 , ... ,  Cfc\,  that  consist  of  non-overlapping  communities,  the  NMI  is 


NMI(C,  C)  = 


2  Ef=i  EL  p(ki  k)lo& 

P{k,k) 

_P(k)P(k)_ 

-  EfcLi  p(k)  los  ip(k )]  -  Ef=i  p(k)  i°g 

1 - 1 

1 _ 1 

(4.20) 


where  P(k,  k)  = 


|c‘nN  P(k)  =  Ifl,  and  P(k)  =  !^1. 


N  ’  V'V  N  ’  V'V  N 

We  examine  two  types  of  LFR  networks.  One  is  the  1000-node  ensembles  used 
in  Ref.  [53]: 

LFRlk  :  (1000, 20,  50,  2, 1,  /j,  10,  50) , 


where  11  G  {0.1,  0.15, ...,  0.8}.  The  other  is  a  50,000-node  network,  which  we 
call  “LFR50k”  and  construct  as  a  composition  of  50  LFRlk  networks.  (See  the 
detailed  description  below.) 


4. 6. 1.1  LFRlk  networks 

We  use  the  RMM  scheme  (with  Wig  =  80)  and  the  GenLouvain  code  on  en¬ 
sembles  of  LFRlk(1000,  20,  50,  2, 1,  /i,  10,  50)  graphs  with  mixing  parameters  /i  e 
{0.1,  0.15, ... ,  0.8}.  We  consider  100  LFRlk  networks  for  each  value  of  /i,  and  we 
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(a)  NMI  and  Modularity  ( Q ). 


RMM  Nc 

O  Genlouvain  Nc 
-0-GTNc _ 


(b)  Number  of  Communities  (Nc). 

Figure  4.2:  Tests  on  LFRlk  networks  with  RMM  and  GenLouvain.  The  ground- 
truth  communities  are  denoted  by  GT. 


use  a  resolution  parameter  of  7  =  1. 

In  Figure  4.2,  we  plot  the  mean  maximized  modularity  score  (Q),  the  number  of 
communities  (Nc),  and  the  NMI  of  the  partitions  compared  with  the  ground  truth 
(GT)  communities  as  a  function  of  the  mixing  parameter  /i.  As  one  can  see  from 
panel  (a),  the  RMM  scheme  performs  very  well  for  /i  <  0.5.  Both  its  NMI  score 
and  modularity  score  are  competitive  with  the  results  of  GenLouvain.  However, 
for  n  >  0.5,  its  performance  drops  with  respect  to  both  NMI  and  the  modularity 
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scores  of  its  network  partitions.  From  panel  (b),  we  see  that  RMM  tends  to  give 
partitions  with  more  communities  than  GenLouvain,  and  this  provides  a  better 
match  to  the  ground  truth.  However,  it  is  only  trustworthy  for  /i  <  0.5,  when  its 
NMI  score  is  very  close  to  one. 

The  mean  computational  time  for  one  ensemble  of  LFRlk,  which  includes  15 
networks  corresponding  to  15  values  of  n,  is  22.7  seconds  for  the  GenLouvain 
code  and  17.9  seconds  for  the  RMM  scheme.  As  we  will  see  later  when  we  con¬ 
sider  large  networks,  the  Modularity  MBO  scheme  scales  very  well  in  terms  of  its 
computational  time. 

4.6. 1.2  LFR50k  networks 

To  examine  the  performance  of  our  scheme  on  larger  networks,  we  construct  syn¬ 
thetic  networks  (LFR50k)  with  50,000  nodes.  To  construct  an  LFR50k  network, 
we  start  with  50  different  LFRlk  networks  N\ ,  N-2 ,  •  •  • ,  Wo  with  mixing  parameter 
fi,  and  we  connect  each  node  in  Ns  (s  E  {1,2,...,  50})  to  20/j  nodes  in  jVs+i  uni¬ 
formly  at  random  (where  we  note  that  Wi  =  N\).  We  thereby  obtain  an  LFR50k 
network  of  size  50,  000.  Each  community  in  the  original  Ns,  s  =  1,  2, . . . ,  50  is  a 
new  community  in  the  LFR50k  network.  We  build  four  such  LFR50k  networks 
for  each  value  of  /i  =  0.1,0.15, . . .  ,0.8,  and  we  find  that  all  such  networks  con¬ 
tain  about  2000  communities.  The  mixing  parameter  of  the  LFR50k  network 
constructed  from  LFRlk(yu)  is  approximately 

By  construction,  the  LFR50k  network  has  a  similar  structure  as  LFRlk.  Im¬ 
portantly,  simply  increasing  N  in  LFR(1V,  ( k ),  /cmax,  £,  /?,  /i,  gmin,  qWx)  to  50,000  is 
insufficient  to  preserve  similarity  of  the  network  structure.  A  large  N  results  in 
more  communities,  so  if  the  mixing  parameter  is  held  constant,  then  the  edges 
of  each  node  that  are  connected  to  nodes  outside  of  its  community  will  be  dis¬ 
tributed  more  sparsely.  In  other  words,  the  mixing  parameter  does  not  entirely 
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reflect  the  balance  between  a  node’s  connection  within  its  own  community  versus 
its  connections  to  other  communities,  as  there  is  also  a  dependence  on  the  total 
number  of  communities. 

The  distribution  of  node  strengths  in  LFR50k  is  scaled  approximately  by  a 
factor  of  (1  +  2 /i)  compared  to  LFRlk,  while  the  total  number  of  edges  in  LFR50k 
is  scaled  approximately  by  a  factor  of  50(1  +  2 /i).  Therefore,  the  probability  null 
model  term  in  modularity  (2.12)  is  also  scaled  by  a  factor  of  .  Hence, 
in  order  to  probe  LFR50k  with  a  resolution  scale  similar  to  that  in  LFRlk,  it  is 
reasonable  to  use  the  resolution  7  =  50  to  try  to  minimize  issues  with  modularity’s 
resolution  limit  [78].  We  then  implement  the  RMM  scheme  (Neig  =  100)  and 
the  GenLouvain  code.  Note  that  we  also  implemented  the  RMM  scheme  with 
Ne ig  =  500,  but  there  is  no  obvious  improvement  in  the  result  even  though  there 
are  about  2000  communities.  This  is  because  the  eigenvectors  of  the  subgroups 
are  recalculated  at  each  recursive  step,  so  the  scales  being  resolved  get  finer  as 
the  recursion  step  goes. 

We  average  the  network  diagnostics  over  the  four  LFR50k  networks  for  each 
value  of  mixing  parameter.  In  Fig.  4.3,  we  plot  the  network  diagnostics  versus  the 
mixing  parameter  for  /i  e  {0.1,0.15, . . .  ,0.8}.  In  panel  (a),  we  see  that  the 
performance  of  RMM  is  good  only  when  the  mixing  parameter  is  less  than  0.5, 
though  it  is  not  as  good  as  GenLouvain.  It  seems  that  the  recursive  Modularity 
MBO  scheme  has  some  difficulties  in  dealing  with  networks  with  very  large  number 
of  clusters. 

However  the  computational  time  of  RMM  is  lower  than  that  of  the  GenLouvain 
code  [50].  The  mean  computational  time  for  an  ensemble  of  LFR50k  networks, 
which  includes  15  networks  corresponding  to  15  values  of  /j,  is  690  seconds  for 
GenLouvain  and  220  seconds  for  the  RMM  scheme.  In  Table  4.1,  we  summarize 
the  mean  computational  time  (in  seconds)  on  each  ensemble  of  LFR  data. 
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Mixing  parameter 


(a)  NMI  and  Modularity  ( Q ). 


Mixing  parameter 


(b)  Number  of  Communities  (Nc). 

Figure  4.3:  Tests  on  LFR50k  data  with  RMM  and  GenLouvain. 


LFRlk 

LFR50k 

GenLouvain 

22.7  s 

690  s 

RMM 

17.9  s 

220  s 

Table  4.1:  Computational  time  on  LFR  data. 

4.6.2  MNIST  handwritten  digit  images 

The  MNIST  database  consists  of  70,000  images  of  size  28  x  28  pixels  containing 
the  handwritten  digits  “0”  through  “9”  [1].  The  digits  in  the  images  have  been 
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normalized  with  respect  to  size  and  centered  in  a  fixed-size  grey  image.  In  this 
section,  we  use  two  networks  from  this  database.  We  construct  one  network  using 
all  samples  of  the  digits  “4”  and  digit  “9” ,  which  are  difficult  to  distinguish  from 
each  other  and  which  constitute  13782  images  of  the  70000.  We  construct  the 
second  network  using  all  images.  In  each  case,  our  goal  is  to  separate  the  distinct 
digits  into  distinct  communities. 

We  construct  the  adjacency  matrices  (and  hence  the  graphs)  W  of  these  two 
data  sets  as  follows.  First,  we  project  each  image  (a  282-dimensional  datum)  onto 
50  principal  components.  For  each  pair  of  nodes  n,  and  n3  in  the  50-dimensional 
space,  we  then  let  wi3  =  exp  ^  j  if  rij  is  among  the  10  nearest  neighbors  of  n*; 
otherwise,  we  let  Wij  =  0.  To  make  W  a  symmetric  matrix,  we  take  W  =  w+w  . 
The  quantity  dij  is  the  L2  distance  between  nt  and  n3 ,  the  parameter  cq  is  the 
mean  of  distances  between  nt  and  its  10  nearest  neighbors. 

In  this  data  set,  the  maximum  number  of  communities  is  two  when  considering 
only  the  digits  “4”  and  “9”,  and  it  is  10  when  considering  all  digits.  We  can  thus 
choose  a  small  search  range  for  n  and  use  the  Multi-h  Modularity  MBO  scheme. 

4.6. 2.1  MNIST  “4-9”  digits  network 

This  weighted  network  has  13782  nodes  and  194816  weighted  edges.  We  use  the 
labeling  of  each  digit  image  as  the  ground  truth.  There  are  two  groups  of  nodes: 
ones  containing  the  digit  “4”  and  ones  containing  the  digit  “9” .  We  use  these  two 
digits  because  they  tend  to  look  very  similar  when  they  are  written  by  hand.  In 
Figure  4.4  (a),  we  show  a  visualization  of  this  network,  where  we  have  projected 
the  data  projected  onto  the  second  and  third  leading  eigenvectors  of  the  graph 
Laplacian  L.  The  difficulty  of  separating  the  “4”  and  “9”  digits  has  been  observed 
in  the  graph-partitioning  literature  (see,  e.g.,  Ref.  [44]).  For  example,  there  is  a 
near-optimal  partition  of  this  network  using  traditional  spectral  clustering  [81,95] 
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(see  below)  that  splits  both  the  “4”-group  and  the  “9” -group  roughly  in  half. 

The  modularity-optimization  algorithms  that  we  discuss  for  the  “4-9”  network 
use  7  =  0.1.  We  choose  this  resolution-parameter  value  so  that  the  network  is 
partitioned  into  two  groups  by  the  GenLouvain  code.  The  question  about  what 
value  of  7  to  choose  is  beyond  the  scope  of  this  thesis,  but  it  has  been  discussed 
at  some  length  in  the  literature  on  modularity  optimization  [33].  Instead,  we 
focus  on  evaluating  the  performance  of  our  algorithm  with  the  given  value  of  the 
resolution  parameter.  We  implement  the  Modularity  MBO  scheme  with  n  =  2  and 
the  Multi-h  MM  scheme,  and  we  compare  our  results  with  that  of  the  GenLouvain 
code  as  well  as  traditional  spectral  clustering  method  [81,95]. 

Traditional  spectral  clustering  is  an  efficient  clustering  method  that  has  been 
used  widely  in  computer  science  and  applied  mathematics  because  of  its  simplicity. 
It  calculates  the  first  k  nontrivial  eigenvectors  (j)  i,  0  2,  ■  ■  ■  ,4>k  (corresponding  to  the 
smallest  eigenvalues)  of  the  graph  Laplacian  L.  Let  U  G  M7Vxfc  be  the  matrix 

containing  the  vectors  4>i,  4*2, _ ,  4>k  as  columns.  For  i  G  {1,  2, . . . ,  N},  let  y.t  G 

be  the  ith  row  vector  of  U.  Spectral  clustering  then  applies  the  h-means  algorithm 
to  the  points  (yi){i=i,...,N}  and  partitions  them  into  k  groups,  where  k  is  the  number 
of  clusters  that  was  specified  beforehand. 

On  this  MNIST  “4-9”  digits  network,  we  specify  k  =  2  and  implement  spectral 
clustering  to  obtain  a  partition  into  two  communities.  As  we  show  in  Figure  4.4 
(b),  we  obtain  a  near-optimal  solution  that  splits  both  the  “4” -group  and  the 
“9” -group  roughly  in  half.  This  differs  markedly  from  the  ground-truth  partition 
in  panel  (a). 

For  the  Multi-h  MM  scheme,  we  use  Wig  =  80  and  the  search  range  n  G 
{2,3,...,  10}.  We  show  visualizations  of  the  partition  at  n  —  2  and  h  —  8  in  Figure 
4.4(c,d).  For  this  method,  computing  the  spectrum  of  the  graph  Laplacian  takes 
a  significant  portion  of  the  run  time  (9  seconds  for  this  data  set).  Importantly, 
however,  this  information  can  be  reused  for  multiple  n,  which  saves  time.  In 
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(c)  Modularity  MBO  Scheme  with  h=  2  (d)  Modularity  MBO  Scheme  with  n  =  8 


Figure  4.4:  (a)-(d)  Visualization  of  partitions  on  the  MNIST  “4-9”  digit  image 
network  by  projecting  it  onto  the  second  and  third  leading  eigenvectors  of  the 
graph  Laplacian.  Shading  indicates  the  community  assignment. 


Figure  4.5  (a),  we  show  a  plot  of  this  method’s  optimized  modularity  scores  versus 
n.  Observe  that  the  optimized  modularity  score  achieves  its  maximum  when  we 
choose  n  —  2,  which  yields  the  best  partition  that  we  obtain  using  this  method. 
In  Figure  4.5(b),  we  show  how  the  partition  evolves  as  we  increase  the  input  n 
from  2  to  10.  At  n  =  2,  the  network  is  partitioned  into  two  groups  (which  agrees 
very  well  with  the  ground  truth).  For  n  >  2,  however,  the  algorithm  starts  to  pick 
out  worse  local  optima,  and  either  “4” -group  or  the  “9” -group  gets  split  roughly 
in  half.  Starting  from  h  =  7,  the  number  of  communities  stabilizes  at  about  four 
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Input  n  for  n-class  partitioning 


(a)  Modularity  Score 


(b)  Group  Assignments  Visualization 


Figure  4.5:  Implementation  results  of  the  Multi-h  Modularity  MBO  scheme  on 
the  MNIST  “4-9”  digit  images.  In  panel  (b),  shading  indicates  the  community 
assignment.  The  horizontal  axis  represents  the  input  n  (i.e.,  the  maximum  number 
of  communities),  and  the  vertical  axis  gives  the  (sorted)  index  of  nodes.  In  panel 
(a),  we  plot  the  optimized  modularity  score  as  a  function  of  the  input  n. 


instead  of  increasing  with  h.  This  indicates  that  the  Modularity  MBO  scheme 
allows  one  to  obtain  partitions  with  Nc  <  n. 


Nc 

Q 

NMI 

Purity 

Time  (seconds) 

GenLouvain 

2 

0.9305 

0.85 

0.975 

110  s 

Modularity  MBO  (h  =  2) 

2 

0.9316 

0.85 

0.977 

11  s 

Multi- h  MM  (h  G  {2,  3, ... ,  10}) 

2 

0.9316 

0.85 

0.977 

25  s 

Spectral  Clustering  (h-Means) 

2 

NA 

0.003 

0.534 

1.5  s 

Table  4.2:  Computational  time  and  accuracy  comparison  on  MNIST  “4-9”  digits 
network. 


The  purity  score,  which  we  also  report  in  Table  4.2,  measures  the  extent  to 
which  a  network  partition  matches  ground  truth.  Suppose  that  an  fV-node  net¬ 
work  has  a  partition  C  =  {Ci,  C2,  ■  ■  ■ ,  Ck}  into  non-overlapping  communities 
and  that  the  ground-truth  partition  is  C  —  {Ci,  C2,  ■  ■  ■ ,  C^}.  The  purity  of  the 
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partition  C  is  then  defined  as 


prt (C,C)  =  ^^max{e{1)^}|C,fcnCi|  G  [0,1].  (4.21) 

Intuitively,  purity  can  by  viewed  as  the  fraction  of  nodes  that  have  been  assigned 
to  the  correct  community.  However,  the  purity  score  is  not  robust  in  estimating 
the  performance  of  a  partition.  When  the  partition  C  breaks  the  network  into 
communities  that  consist  of  single  nodes,  then  the  purity  score  achieves  a  value 
of  one.  hence,  one  needs  to  consider  other  diagnostics  when  interpreting  the 
purity  score.  In  this  particular  data  set,  a  high  purity  score  does  indicate  good 
performance  because  the  ground  truth  and  the  partitions  each  consist  of  two 
communities. 

Observe  in  Table  4.2  that  all  modularity-based  algorithms  identified  the  cor¬ 
rect  community  assignments  for  more  than  97%  of  the  nodes,  whereas  standard 
spectral  clustering  was  only  correct  for  just  over  half  of  the  nodes.  The  Multi-h 
MM  scheme  takes  only  25  seconds.  If  one  specifies  n  =  2,  then  the  Modularity 
MBO  scheme  only  takes  11  seconds. 

4. 6. 2. 2  MNIST  70k  network 

We  test  our  new  schemes  further  by  consider  the  entire  MNIST  network  of  70,000 
samples  containing  digits  from  “0”  to  “9” .  This  network  contains  about  five  times 
as  many  nodes  as  the  MNIST  “4-9”  network.  However,  the  node  strengths  in  the 
two  networks  are  very  similar  because  of  how  we  construct  the  weighted  adjacency 
matrix.  We  thus  choose  7  =  0.5  so  that  the  modularity  optimization  is  performed 
at  a  similar  resolution  scale  in  both  networks.  There  are  1001664  weighted  edges 
in  this  network. 

We  implement  the  Multi-n  MM  scheme  with  Wig  =  100  and  the  search  range 
n  G  {2,  3, . . . ,  20}.  Even  if  Nc  is  the  number  of  communities  in  the  true  optimal 
solution,  the  input  n  =  Nc  might  not  give  a  partition  with  Nc  groups.  The 
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modularity  landscape  in  real  networks  is  notorious  for  containing  a  huge  number 
of  nearly  degenerate  local  optima  (especially  for  values  of  modularity  Q  near 
the  globally  optimum  value)  [38],  so  we  expect  the  algorithm  to  yield  a  local 
minimum  solution  (such  as  the  merging  of  groups)  rather  than  a  global  minimum. 
Consequently,  it  is  preferable  to  extend  the  search  range  to  n  >  Nc,  so  that  the 
larger  n  gives  more  flexibility  to  the  algorithm  to  try  to  find  the  partition  that 
optimizes  modularity. 

The  best  partition  that  we  obtained  using  the  search  range  n  G  {2,  3, ... ,  20} 
contains  11  communities.  All  of  the  digit  groups  in  the  ground  truth  except  for 
the  “1” -group  are  correctly  matched  to  those  communities.  In  the  partition,  the 
“1” -group  splits  into  two  parts,  which  is  unsurprising  given  the  structure  of  the 
data.  In  particular,  the  samples  of  the  digit  “1”  include  numerous  examples  that 
are  written  like  a  “7” .  This  set  of  samples  are  thus  easily  disconnected  from  the 
rest  of  “1” -group.  If  one  considers  these  two  parts  as  one  community  associated 
with  “l”-group,  then  the  partition  achieves  a  96%  correctness  in  its  classification 
of  the  digits.  The  4%  error  rate  distributes  over  each  digit  group  almost  evenly. 

As  we  illustrate  in  Table  4.3,  the  GenLouvain  code  yields  comparably  successful 
partitions  as  those  that  we  obtained  using  the  Multi-h  MM  scheme.  By  comparing 
the  running  time  of  the  Multi-h  MM  scheme  on  both  MNIST  networks,  one  can  see 
that  our  algorithm  scales  well  in  terms  of  speed  when  the  network  size  increases. 
While  the  network  size  increases  five  times  (5x)  and  the  search  range  gets  doubled 
(2x),  the  computational  time  increases  by  a  factor  of  11.6  «  5  x  2. 

The  number  of  iterations  for  the  Modularity  MBO  scheme  ranges  approxi¬ 
mately  from  35  to  100  for  h  e  { 2,  3, ... ,  20}.  Empirically,  even  though  the  total 
number  of  iterations  can  be  as  large  as  over  a  hundred,  the  modularity  score 
quickly  gets  very  close  to  its  final  value  within  the  first  20  iteration. 

The  computational  cost  of  the  Multi-h  MM  scheme  consists  of  two  parts:  the 
calculation  of  the  eigenvectors  and  the  MBO  iteration  steps.  Because  of  the  size 
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of  the  MNIST  70k  network,  the  first  part  costs  about  90  seconds  in  Matlab. 
However,  one  can  incorporate  a  faster  eigenvector  solver,  such  as  the  Raylcigh- 
Chebyshev  (RC)  procedure  of  [3],  to  improve  the  computation  speed  of  an  eigen- 
decomposition.  This  solver  is  especially  fast  for  producing  a  small  portion  (in 
this  case,  1/700)  of  the  leading  eigenvectors  for  a  sparse  symmetric  matrix.  Upon 
implementing  the  RC  procedure  in  C++  code,  it  only  takes  12  seconds  to  compute 
the  100  leading  eigenvector-eigenvalue  pairs.  Once  the  eigenvectors  are  calculated, 
they  can  be  reused  in  the  MBO  steps  for  multiple  values  of  h  and  different  initial 
functions  /°.  This  allows  good  scalability,  which  is  a  particularly  nice  feature  of 
using  this  MBO  scheme. 


Nc 

Q 

NMI 

Purity 

Time  (second) 

GenLouvain 

11 

0.93 

0.916 

0.97 

10900  s 

Multi-n  MM  (n  G  {2,  3, ... ,  20}) 

11 

0.93 

0.893 

0.96 

290  s  /  212  s* 

Modularity  MBO  3%  GT  (h  =  10) 

10 

0.92 

0.95 

0.96 

94.5  s  /  16.5  s* 

*  Calculated  with  the  RC  procedure. 

Table  4.3:  Computational  time  and  accuracy  comparison  on  MNIST  70k  network. 


Another  benefit  of  the  Modularity  MBO  scheme  is  that  it  allows  the  possibility 
of  incorporating  a  small  portion  of  the  ground  truth  in  the  modularity  optimiza¬ 
tion  process.  In  this  work,  we  implement  the  Modularity  MBO  using  3%  of  the 
ground  truth  by  specifying  the  true  community  assignments  of  2100  nodes  in  the 
initial  function  /°;  we  chose  the  nodes  uniformly  at  random.  We  also  let  n  =  10. 
With  the  eigenvectors  already  computed  (which  took  12  seconds  using  the  RC 
process),  the  MBO  steps  take  a  subsequent  4.5  seconds  to  yield  a  partition  with 
exactly  10  communities  and  96.4%  of  the  nodes  classified  into  the  correct  groups. 
The  authors  of  Ref.  [35,  59]  also  implemented  a  segmentation  algorithm  on  this 
MNIST  70k  data  with  3%  of  the  ground  truth,  and  they  obtained  a  partition 
with  a  correctness  96.9%  in  15.4  seconds.  In  their  algorithm,  the  ground  truth 
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was  enforced  by  adding  a  quadratic  fidelity  term  to  the  energy  functional  (semi- 
supervised).  The  fidelity  term  is  the  L2  distance  of  the  unknown  function  /  and 
the  given  ground  truth.  In  our  scheme,  however,  it  is  only  used  in  the  initial  func¬ 
tion  f°.  Nevertheless,  it  is  also  possible  to  add  a  fidelity  term  to  the  Modularity 
MBO  scheme  and  thereby  perform  semi-supervised  clustering. 

4.6.3  Network-science  coauthorships 

Another  well-known  graph  in  the  community  detection  literature  is  the  network 
of  coauthorships  of  network  scientists.  This  benchmark  was  compiled  by  Mark 
Newman  and  first  used  in  [68]. 

In  this  work,  we  use  the  graph’s  largest  connected  component,  which  consists  of 
379  nodes  representing  authors  and  914  weighted  edges  that  indicate  coauthored 
papers.  We  do  not  have  any  so-called  ground  truth  for  this  network,  but  it  is  useful 
to  compare  partitions  obtained  from  our  algorithm  with  those  obtained  using  more 
established  algorithms.  In  this  section,  we  use  GenLouvain’s  result  as  this  pseudo¬ 
ground  truth.  In  addition  to  Modularity-MBO,  RMM,  and  GenLouvain,  we  also 
consider  the  results  of  modularity-based  spectral  partitioning  methods  that  allow 
the  option  of  either  bipartitioning  or  tripartitioning  at  each  recursive  stage  [68,79]. 

In  Ref.  [68],  Newman  proposes  a  spectral  partitioning  scheme  for  modular¬ 
ity  optimization  by  using  the  leading  eigenvectors  (associated  with  the  largest 
eigenvalues)  of  a  so-called  modularity  matrix  B  =  W  —  P  to  approximate  the 
modularity  function  Q.  In  the  modularity  matrix,  P  is  the  probability  null  model 
and  Pij  =  is  the  NG  null  model  with  7  =  1.  Assume  that  one  uses  the  first  p 
leading  eigenvectors  {ui,  u2, . . . ,  up},  and  let  j3j  denote  the  eigenvalue  of  u j  and 
U  =  (ui | U2 1  ■  •  •  | Up).  We  then  define  N  node  vectors  r.,  £  whose  jth  component 
is 

( ri)j  =  y/ flj  ~  aUij  , 
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where  a  <  (3P  and  j  G  {1,  2, . . .  ,p}.  The  modularity  Q  is  therefore  approximated 
as 

h 

Q^Q  =  Na  +  YJ  IIR/IIL,  (4-22) 

i=i 

where  R;  =  Ylg  =i  r*  is  sum  °f  &U  node  vectors  in  the  Zth  community  (where 
l  G  {1,  2, . . . ,  h}). 

A  partition  that  maximizes  (4.22)  in  a  given  step  must  satisfy  the  geometric 
constraints  R;  •  r*  >  0,  r/,:  =  /,  and  R;  •  R/j  <  0  for  all  l,h  G  {1,2,...,  n}.  Hence, 
if  one  constructs  an  approximation  Q  using  p  eigenvectors,  a  network  component 
can  be  split  into  at  most  p+  1  groups  in  a  given  recursive  step.  The  choice  p  —  2 
allows  either  bipartitioning  or  tripartitioning  in  each  recursive  step.  Reference  [68] 
discusses  the  case  of  general  p  but  reports  results  for  recursive  bipartitioning  with 
p  —  1.  Reference  [79]  implements  this  spectral  method  with  p  —  2  and  a  choice  of 
bipartitioning  or  tripartioning  at  each  recursive  step. 

In  Table  4.4,  we  report  diagnostics  for  partitions  obtained  by  several  algo¬ 
rithms  (for  7  =  1).  For  the  recursive  spectral  bipartitioning  and  tripartitioning, 
we  use  Matlab  code  provided  by  the  authors  of  Ref.  [79].  They  informed  us  that 
this  particular  implementation  was  not  optimized  for  speed,  so  we  expect  it  to  be 
slow.  One  can  create  much  faster  implementations  of  the  same  spectral  method. 
The  utility  of  this  method  for  the  present  comparison  is  that  Ref.  [79]  includes  a 
detailed  discussion  of  its  application  to  the  network  of  network  scientists.  Each 
partitioning  step  in  this  spectral  scheme  either  bipartitions  or  tripartitions  a  group 
of  nodes.  Moreover,  as  discussed  in  Ref.  [79],  a  single  step  of  the  spectral  triparti¬ 
tioning  is  by  itself  interesting.  Hence,  we  specify  n  —  3  for  the  Modularity  MBO 
scheme  as  a  comparison. 

From  Table  4.4,  we  see  that  the  Modularity  MBO  scheme  with  n  —  3  gives 
a  higher  modularity  than  a  single  tripartition,  and  the  former’s  NM1  and  purity 
are  both  significantly  higher.  When  we  do  not  specify  the  number  of  clusters, 
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Nc 

Q 

NMI 

Purity 

Time  (seconds) 

GenLouvain 

19 

0.8500 

1 

1 

0.5  s 

Spectral  Recursion 

39 

0.8032 

0.8935 

0.9525 

60  s 

RMM 

23 

0.8344 

0.9169 

0.9367 

0.8  s 

Tripartition 

3 

0.5928 

0.3993 

0.8470 

50  s 

Modularity  MBO 

3 

0.6165 

0.5430 

0.9974 

0.4  s 

Table  4.4:  Computational  time  and  accuracy  comparison  on  network-science  coau¬ 
thorships. 

the  RMM  scheme  achieves  a  higher  modularity  score  and  NMI  than  recursive 
bipartitioning/tripartitioning,  though  the  former’s  purity  is  lower  (which  is  not 
surprising  due  to  its  larger  Nc).  The  RMM  scheme  and  GenLouvain  have  similar 
run  times.  For  any  of  these  methods,  one  can  of  course  use  subsequent  post¬ 
processing,  such  as  Kernighan-Lin  node-swapping  steps  [68,76,79],  to  find  higher- 
modularity  partitions. 

4.6.4  A  note  on  computational  heuristics  and  time  complexity 

Numerous  computational  heuristics  have  been  employed  to  optimize  network  mod¬ 
ularity  [33,76].  We  have  compared  our  results  with  implementations  of  a  small 
number  of  popular  methods  that  others  have  made  available.  We  report  compu¬ 
tation  times  in  our  discussions.  Our  results  above  demonstrate  the  good  speed 
of  our  method,  but  we  have  not,  for  example,  included  a  comparison  of  its  speed 
with  Blondel  et  al.’s  C++  implementations  [10]  of  the  Louvain  method  [9].  Im¬ 
portantly,  the  low  computational  cost  of  our  method  is  a  direct  result  of  its  firm 
theoretical  grounding,  and  our  reformulation  of  the  problem  of  modularity  opti¬ 
mization  offers  hope  for  the  development  of  even  faster  methods  in  the  future. 
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4.7  Conclusion  and  discussion 


In  summary,  we  present  a  novel  perspective  on  the  problem  of  modularity  opti¬ 
mization  by  reformulating  it  as  a  minimization  of  an  energy  functional  involving 
the  total  variation  on  a  graph.  This  provides  an  interesting  bridge  between  the 
network  science  and  compressive  sensing  communities,  and  it  allows  the  use  of 
techniques  from  compressive  sensing  and  image  processing  to  tackle  modularity 
optimization.  We  propose  the  MBO  scheme  that  can  handle  large  data  at  very 
low  computational  cost.  Our  algorithms  produce  competitive  results  compared  to 
existing  methods,  and  they  scale  well  in  terms  of  speed  for  certain  networks  (such 
as  the  MNIST  data).  After  computing  the  eigenvectors  of  the  graph  Laplacian, 
the  time  complexity  of  each  MBO  iteration  step  is  O(N). 

One  major  part  of  our  schemes  is  to  calculate  the  leading  eigenvector-eigenvalue 
pairs,  so  one  can  benefit  from  the  fast  numerical  Rayleigh-Chebyshev  procedure 
in  Ref.  [3]  when  dealing  with  large,  sparse  networks.  Furthermore,  for  a  given  net¬ 
work  (which  is  represented  by  a  weighted  adjacency  matrix),  one  can  reuse  pre¬ 
viously  computed  eigen-decompositions  for  different  choices  of  initial  functions, 
different  values  of  n,  and  different  values  of  the  resolution  parameter  7.  This 
provides  welcome  flexibility,  and  it  can  be  used  to  significantly  reduce  computa¬ 
tion  time  because  the  MBO  step  is  extremely  fast,  as  each  step  is  O(N)  and  the 
number  of  iterations  is  empirically  small. 

Importantly,  our  reformulation  of  modularity  also  provides  the  possibility  to 
incorporate  partial  ground  truth.  This  can  be  accomplished  either  by  feeding 
the  information  into  the  initial  function  or  by  adding  a  fidelity  term  into  the 
functional.  (We  only  pursue  the  former  approach  in  this  work.)  It  is  not  obvious 
how  to  incorporate  partial  ground  truth  using  previous  optimization  methods. 
This  ability  to  use  our  method  either  for  unsupervised  or  for  semi- supervised 
clustering  is  a  significant  boon. 


72 


CHAPTER  5 


Graph  Mumford-Shah  in  Hyperspectral  Data 


In  this  chapter  we  focus  on  the  multi-class  segmentation  problem  using  the  piece- 
wise  constant  Mumford-Shah  model  (1.1)  in  a  graph  setting.  A  graph  variational 
method  is  developed  similarly  to  the  previous  chapter.  Application  of  the  pro¬ 
posed  method  on  the  problem  of  chemical  plume  detection  in  hyper-spectral  video 
data  is  presented  afterwards. 

Recall  the  definition  of  the  continuous  piecewise  constant  Mumford-Shah  model 
introduced  in  Chapter  1: 

n  n 

Ams($,  (cr}ti)  =  1*1  +  A  Y,  /  (“o  -  c,)2  ,  (5.1) 

r=\  J  ^ r 

where  a  given  contour  segments  an  image  region  hi  into  n  many  disjoint  sub- 
regions  fl  =  U”=1f2r,  u0  is  the  observed  imagery  data,  {cr}”=1  is  a  set  of  constant 
values,  and  |<L|  denotes  the  length  of  the  contour  <f>. 

To  study  the  hyper-spectral  data  via  a  graph,  each  node  (pixel)  nl  is  associated 
with  a  d-dimensional  feature  vector  (spectral  channels).  Let  uq  :  G  denote 

the  raw  hyper-spectral  data,  where  Mo('^i)  represents  the  d-dimensional  spectral 
channels  of  nt .  In  this  setting,  we  present  a  graph  version  of  the  multi-class 
piecewise  constant  Mumford-Shah  energy  functional: 

1  h 

MS(/,W:=i)  :=  2l/l7v  +  ^£(IK-c,||2,/,.),  (5.2) 

r— 1 

where  {cr}"=1  C  Md,  ||uo  —  cr ||2  denotes  an  JVx  1  vector 

(||wo(rii)  -  cr||2, . . . ,  ||«0(?Lv)  -  cr ||2) T  , 
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and  (||«o  —  cr||2,  fr)  =  YliLi  fr(ni)\\uo(ni)  —  cr||2.  Note  that  when  n*  and  rij  belong 
to  different  classes,  we  have  | ffnf)  —  =  2,  which  leads  to  the  coefficient  in 

front  of  the  term  \  \f\TV- 

To  see  the  connection  between  (5.2)  and  (5.1),  one  first  observes  that  fr  is  the 
characteristic  function  of  the  r-th  class,  and  thus  (||uo— cr||2,  fr)  is  analogous  to  the 
term  fa  (u o  —  cr )2  in  (5.1).  Furthermore,  the  total  variation  of  the  characteristic 
function  of  a  region  gives  the  length  of  its  boundary  contour,  and  therefore  \f\TV 
is  the  graph  analogy  of  |<F|. 

In  order  to  find  a  segmentation  for  G,  we  propose  to  solve  the  following  mini¬ 
mization  problem: 


min  MS(/,{cr}"=1). 

/eB,{Cr}™=1cRd 


(5.3) 


The  resulting  minimizer  /  yields  a  partition  of  G. 

One  can  observe  that  the  optimal  solution  of  (5.3)  must  satisfy: 

(u0,  fr) 


Cr  = 


Ef„  fM) ' 


(5.4) 


if  the  r-th  class  is  non-empty. 


Note  that  for  the  minimization  problem  given  in  (5.3),  it  is  essentially  equiv¬ 
alent  to  the  K-means  method  when  A  goes  to  +oo.  When  A  — >  0,  the  minimizer 
approaches  a  constant. 


5.1  Mumford-Shah  MBO  and  Lyapunov  functional 

The  authors  of  [62,  63]  introduce  an  efficient  algorithm  (known  as  the  MBO 
scheme)  to  approximate  motion  by  mean  curvature  of  an  interface  in  Euclidean 
space.  The  general  procedure  of  the  MBO  scheme  alternates  between  solving  a 
linear  heat  equation  and  thresholding.  One  interpretation  of  the  scheme  is  that 
it  replaces  the  non-linear  term  of  the  Allen-Cahn  equation  with  thresholding  [29]. 
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In  this  section  we  propose  a  variant  of  the  original  MBO  scheme  to  approximately 
find  the  minimizer  of  the  energy  MS(f,  (cr}”=1)  presented  in  (5.2).  Inspired  by 
the  work  of  [30,89],  we  write  out  a  Lyapunov  functional  YT(f)  for  our  algorithm 
and  prove  that  it  decreases  at  each  iteration  of  the  MBO  scheme. 


5.1.1  Mumford-Shah  MBO  scheme 

We  first  introduce  a  “diffuse  operator”  Tr  =  e-rL,  where  L  is  the  graph  Laplacian 
defined  above  and  r  is  a  time  step  size.  The  operator  Tr  is  analogous  to  the 
diffuse  operator  e-rA  of  the  heat  equation  in  PDE  (Euclidean  space).  It  satisfies 
the  following  properties. 

Proposition  2.  Given  Tt  =  e-rL,  firstly  Tr  is  strictly  positive  definite,  i.e. 
(/, rr/)  >  0  for  any  f  E  IK,  /  0.  Secondly,  Tr  conserves  the  mass,  i.e. 
(l,r Tf)  =  (1,/).  At  last,  the  quantity  A-(l  —  f,VTf)  approximates  \  |/|ry,  for 
any  f  E  B. 


Proof.  Taylor  expansion  gives 


—  T  L 


T 


=  I-T  L  +  — L  -  — +  . . . 


2! 


3! 


Suppose  v  is  an  eigenvector  of  L  associated  with  the  eigenvalue  £.  One  then  has 
Ttv  =  e~T^v  =>■  {v,Ttv)  =  e~T^(v,v)  >  0.  Let  the  eigen-decomposition  (with 
respect  to  L)  for  a  non-zero  /  :  G  — >  M  to  be  /  =  where  is  a 

set  of  orthogonal  eigenvectors  of  L  (note  that  L  is  positive  definite).  Because  Tr 
is  a  linear  operator,  one  therefore  has  (/,  VTf)  =  af (<j>i,  Tr0j)  >  0. 

For  the  second  property,  LI  =  0  =>■  (1,  LA  /)  =  0,  where  1  is  an  N-dimensional 
vector  with  one  at  each  entry.  Therefore,  the  Taylor  expansion  of  Tr  gives 

<i,r  r/>  =  (!,/>■ 

At  last,  rT  ~  I  -  tL  =*  A(1  _  fi  r  Tf)  ~  A<1  _/,/)_  1<!,  L /)  +  \{f,  L /). 
Particularly  when  f  E  B,  we  have  A.{1  —  f,  f)  =  |(1,L/)  =  0  and  |(/,  L/)  = 
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\  \  f\TV.  Hence  ^(1  -  /,  VTf)  approximates  \  \f\TV  in  B.  .0. 

An  intuitive  interpretation  of  the  third  property  of  Proposition  2  is  illustrated 
in  Figure  5.1.  For  an  indicator  function  /  =  shown  in  (a),  the  diffuse  operator 
propagates  the  nonzero  support  set  of  /  to  leak  outside  of  A  by  the  scale  of  r 
(time  step),  as  shown  in  (b).  The  inner  product  (1  —  f,Trf)  gives  the  shadowed 
margin  (c)  around  the  boundary,  which  is  an  approximation  of  the  perimeter  of 
set  A.  Note  that  the  operator  (/  +  tL)-1  also  satisfies  the  above  three  properties, 
and  can  serve  the  same  purpose  as  e-rL,  as  far  as  this  work  concerns. 


(a)/  (b)  TTf  (c)(l-/,rT/> 

Figure  5.1:  Illustration  of  the  third  property  of  Proposition  2:  the  quantity  A(i  _ 
/,  rr/)  approximates  \  \  f\TVi  f°r  any  /  €  ®. 

The  proposed  Mumford-Shah  MBO  scheme  for  the  minimization  problem  (5.3) 
consists  of  alternating  between  the  following  three  steps: 

For  a  given  fk  G  B  at  the  k-th  iteration  and  ck  =  yyyfy, 

1.  Compute 

/  =  r Tfk  ~  t\  (||tt0  -  Ci  ||2,  ||tt0  -  c|||2, . . . ,  ||tt0  -  4||2)  ,  (5.5) 

2.  (Thresholding) 

fk+1{rii)  =er,  r  =  argrna xc/c(n;) 

for  all  i  G  {1,  2, . . . ,  N},  where  er  is  the  r-th  standard  basis  in  W1,  i.e. 
fk+1[ni)  =  1  and  fk+1  G  B. 
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3.  (Update  c) 


k+i  K,  fr+l) 
(1  ,/*+1>  ' 


5.1.2  A  Lyapunov  functional 

We  introduce  a  Lyapunov  functional  YT  for  the  Mumford-Shah  MBO  scheme: 

2,  fr) ,  subject  to  cr  =  •  (5.6) 

According  to  the  third  property  of  Tr  in  Proposition  1,  the  energy  YT(f)  approx¬ 
imates  M5(/,{cr}"=1)  for  /  G  B  and  cr  =  •  A  similar  functional  for  the 

graph  total  variation  is  shown  and  discussed  in  [90]. 

Pursuing  similar  ideas  as  in  [30,89],  we  present  the  following  analysis  which 
consequently  shows  that  the  Mumford-Shah  MBO  scheme  (with  time  step  r)  de¬ 
creases  Yt  and  converges  to  a  stationary  state  within  a  finite  number  of  iterations. 

First  define 

1  " 

Gr(f,c)  :=  — <1  -  /,  Trf)  +  -  crf,  fr)  .  (5.7) 

"  r— 1 

Proposition  3.  The  functional  Gr(-,c)  is  strictly  concave  on  IK,  for  any  fixed 

{Cr}hr=l  e  Md. 

Proof.  Take  /, g  e  IK,  a  e  (0, 1).  We  have  (1  —  a)f  +  ag  e  IK,  because  IK  is  a 
convex  set. 


YtU)  ■■=  T(i  -  /. rr/>  +  a ^(|K  -  c. 

“  r= 1 


Gr  ((1  -  a)f  +  ag,  c)  -  (1  -  a)Gr(f ,  c)  -  aGr(g,  c ) 
=  ~a)(f-  g,VT(f  -  g))  >  0. 


(5.8) 


Equality  only  holds  when  f  —  g.  Therefore,  GT(-,  c )  is  strictly  concave  on  IK. 

□ 
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Aside  from  the  concavity  of  Gr,  we  observe  that  the  first  order  variation  of 
Gt(-,c)  is  given  as 

JjGr(f,c )  =  ^(l-2rr/)  +  A(||Mo-ci||2,||«o-c2||2,...)  . 

Note  that  since  (jjGr(fk,  ck),  f)  is  linear,  the  Step  2  (thresholding)  in  the 
Mumford-Shah  MBO  scheme  is  equivalent  to 

fk+1  :=  argmin feK(jjGT(fk,  ck ),  /) . 

Theorem  6.  In  the  Mumford-Shah  MBO  scheme,  the  Lyapunov  functional  Yr(fk+1) 
at  the  (. k  +  1  )-th  iteration  is  no  greater  than  Yr(fk).  Equality  only  holds  when 
fk  =  fk+1 .  Therefore,  the  scheme  achieves  a  stationary  point  in  IB  within  a  finite 
number  of  iterations. 

Proof. 

fi+1  :=  argmin/6K -{jjGAf,  <*),/)  (5.9) 

=>■  fk+l  G  B  (due  to  linearity)  and 

>  GT(fM,ck)-GT(fk,ck)  (concavity) 

=>•  GT(fk+1  ,ck)  <  Gr(fk,ck )  =  YT(fk).  Observe  that  ck+1  =  ('^k+-T^  is  the 

minimizer  of 

argmin{Cr}ft=i6EdGT(/fc+1,  c) 

=►  Gr{fk+\ck+l)  <  GT{fk+\ck)  <  Yr(fk). 

YT(fk+1)  <  YT(fk).  Therefore  the  Lyapunov  functional  Yr  is  decreasing  on  the 
iterations  of  the  Mumford-Shah  MBO  scheme,  unless  fk+1  =  fk.  Since  B  is  a 
finite  set,  a  stationary  point  can  be  achieved  in  a  finite  number  of  iterations. 

□ 


78 


Minimizing  the  Lyapunov  energy  Yr  is  an  approximation  of  the  minimization 
problem  in  (5.3),  and  the  proposed  MBO  scheme  is  proven  to  decrease  Yr.  There¬ 
fore,  we  expect  the  Mumford-Shah  MBO  scheme  to  approximately  solve  (5.3). 
In  Section  3.3  and  Section  3.4,  we  introduce  techniques  for  computing  the  MBO 
iterations  efficiently. 

5.1.3  Eigen-space  approximation 

To  solve  for  (5.5)  in  Step  1  of  the  Mumford-Shah  MBO  scheme,  one  needs  to 
compute  the  operator  Tr,  which  can  be  difficult  especially  for  large  datasets.  For 
the  purpose  of  efficiency,  we  numerically  solve  for  (5.5)  by  using  a  small  number  of 
the  leading  eigenvectors  of  L  (which  correspond  to  the  smallest  eigenvalues),  and 
project  fk  onto  the  eigen-space  spanned  from  the  eigenvectors.  By  approximating 
the  operator  L  with  the  leading  eigenvectors,  one  can  compute  (5.5)  efficiently. 
We  use  this  approximation  because  in  graph  clustering  methods,  researchers  have 
been  using  a  small  portion  of  the  leading  eigenvectors  of  a  graph  Laplacian  to 
extract  structural  information  of  the  graph. 

Let  l=i  denote  the  first  M  (orthogonal)  leading  eigenvectors  of  L.  and 

the  corresponding  eigenvalues.  Assume  fk  =  J2m=i  ^moY1,  where  am  is 
a  1  x  n  vector,  with  the  r-th  entry  a™  =  (fk,  (f>m).  Thus  /  can  be  approximately 
computed  as: 

M 

f  =  (IK  -  Ci ||2,  ||w0  -  C2II2, . . . ,  ||«o  -  c\ ||2)  .  (5.11) 

m= 1 

The  Mumford-Shah  MBO  algorithm  with  the  above  eigen-space  approximation 
is  summarized  in  Algorithm  1.  After  the  eigenvectors  are  obtained,  each  iteration 
of  the  MBO  scheme  is  of  time  complexity  O(N).  Empirically,  the  algorithm 
converges  after  a  small  number  of  iterations.  Note  that  the  iterations  stop  when 
a  purity  score  between  the  partitions  from  two  consecutive  iterations  is  greater 
than  99.9%.  The  purity  score,  as  used  in  [45]  and  previous  chapters,  measures 
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how  “similar”  two  partitions  are.  Intuitively,  it  can  be  viewed  as  the  fraction  of 
nodes  of  one  partition  that  have  been  assigned  to  the  correct  class  with  respect 
to  the  other  partition. 

Algorithm  2  Mumford-Shah  MBO  Algorithm 

Input:  /°,  M0,  {(0m,Cm}m= i,  A  A,  n,  k  =  0. 

while  (purity(/fe,  fk+1)  <  99.9%)  do 

.  _  _  i«oJb 
r  E2Li  fHm) ' 

•  <C  =  (fri<f>m)- 

•  f  =  Em=i  e~rU(pmam  -  r\  (||m0  -  ci||2,  ||w0  -  c2||2, . . . ,  ||w0  -  cA||2). 

•  /fc+1(nj)  =  er,  where  r  =  argma xc/c(nj). 

•  k  k  +  1. 

end  while 


5.1.4  Nystrom  method 

The  Nystrom  extension  [34]  is  a  matrix  completion  method  which  has  been  used  to 
efficiently  compute  a  small  portion  of  the  eigenvectors  of  the  graph  Laplacian  for 
segmentation  problems  [8,35,59,60].  In  our  proposed  scheme,  leading  eigenvectors 
of  L  are  needed,  which  can  require  massive  computational  time  and  memory.  For 
large  graphs  such  as  the  ones  induced  from  images,  the  explicit  form  of  the  weight 
matrix  W  and  therefore  L  is  difficult  to  obtain  (0(A2)  time  complexity).  Hence, 
we  expect  to  use  the  Nystrom  method  to  approximately  compute  the  eigenvectors 
for  our  algorithm. 

Given  a  similarity  matrix  W  to  be  computed,  the  Nystrom  method  only  ran¬ 
domly  samples  a  very  small  number  (M)  of  rows  of  W,  M  «  N .  After  reordering 
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the  rows,  one  can  assume  the  first  M  rows  are  the  sampled  ones: 

A  B 

W  = 

bt  c 

i.e.  the  M  x  M  matrix  A  and  the  M  x  (N  —  M)  matrix  B  are  computed  and 
known.  The  only  unknown  part  of  W  is  the  (N  —  M )  x  (N  —  M )  matrix  C.  The 
core  idea  of  the  Nystrom  method  is  to  approximate  C  by  BTA~1B.  Based  on 
matrix  completion  and  properties  of  eigenvectors,  ones  can  approximately  obtain 
M  eigenvectors  of  the  symmetric  normalized  graph  Laplacian  Bsym  without  ex¬ 
plicitly  computing  C .  Detailed  descriptions  of  the  Nystrom  method  can  be  found 
in  [8,60], 

Note  that  our  previous  analysis  only  applies  to  L  rather  than  L sym,  and  the 
Nystrom  method  can  not  be  trivially  formularized  for  L.  Therefore  this  ques¬ 
tion  remains  to  be  studied.  However,  the  normalized  Laplacian  hsym  has  many 
similar  features  compared  to  L,  and  it  has  been  used  in  place  of  L  in  many  seg¬ 
mentation  problems.  In  the  numerical  results  shown  below,  the  eigenvectors  of 
L Sym  computed  via  Nystrom  perform  well  empirically.  One  can  also  implement 
other  efficient  methods  to  compute  the  eigenvectors  for  the  Mumford-Shah  MBO 
algorithm. 

5.2  Numerical  Results 

The  hyper-spectral  images  tested  in  this  work  are  taken  from  the  video  recording 
of  the  release  of  chemical  plumes  at  the  Dugway  Proving  Ground,  captured  by 
long  wave  infrared  (LWIR)  spectrometers.  The  data  is  provided  by  the  Applied 
Physics  Laboratory  at  Johns  Hopkins  University.  A  detailed  description  of  this 
dataset  can  be  found  in  [18].  We  take  seven  frames  from  a  plume  video  sequence 
in  which  each  frame  is  composed  of  128  x  320  pixels.  We  use  a  background  frame 
and  the  frames  numbered  72  through  77  containing  the  plume.  Each  pixel  has 


81 


(a)  2nd  Eigenvector 


(b)  3rd  Eigenvector 


(c)  4th  Eigenvector 


(d)  5th  Eigenvector 


Figure  5.2:  The  leading  eigenvectors  of  the  normalized  graph  Laplacian  computed 
via  the  Nystrom  method. 

129  spectral  channels  corresponding  to  a  particular  frequency  in  the  EM  spectrum 
ranging  from  7,820  nm  to  11,700  nm.  Thus,  the  graph  we  construct  from  these 
seven  frames  is  of  size  7  x  128  x  320  with  each  node  n*  corresponding  to  a  pixel 
with  a  129-dimensional  spectral  signature  V{.  The  metric  for  computing  the  weight 
matrix  is  given  as: 

(l  _  (vj,Vj)  \2 

r  '  IMIIKII '  ^ 

wtj  =  exp{ - — - }  , 

where  a  =  0.01  is  chosen  empirically  for  the  best  performance.  Note  that  in  this 
experiment  a  is  a  robust  parameter  which  gives  decent  results  for  a  wide  range  of 
values  (0.001  <  cr  <  10). 

The  goal  is  to  segment  the  image  and  identify  the  “plume  cloud”  from  the 
background  components  (sky,  mountain, grass),  without  any  ground  truth.  As 
described  in  the  previous  section,  M  =  100  eigenvectors  of  the  normalized  graph 
Laplacian  (LSJ/m)  are  computed  via  the  Nystrom  method.  The  computational 
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Figure  5.3:  The  segmentation  results  obtained  by  the  Mumford-Shah  MBO 
scheme,  on  a  background  frame  plus  the  frames  72-77.  Shown  in  (a)  and  (b)  are 
segmentation  outcomes  obtained  with  different  initializations.  The  visualization 
of  the  segmentations  only  includes  the  first  four  frames. 


time  using  Nystrom  is  less  than  a  minute  on  a  2.8GHz  machine  with  Intel  Core 
2Duo.  The  visualization  of  the  first  five  eigenvectors  (associated  with  the  smallest 
eigenvalues)  are  given  in  Figure  5.2  for  the  first  four  frames,  (the  first  eigenvector 
is  not  shown  because  it  is  close  to  a  constant  vector). 

We  implement  the  Mumford-Shah  MBO  scheme  using  the  eigenvectors  on  this 
seven  frames  of  plume  images,  with  r  =  0.15,  A  =  150  and  n  —  5.  The  test  is  run 
for  20  times  with  different  uniformly  random  initialization,  and  the  segmentation 
results  are  shown  in  Figure  5.3.  Note  that  depending  on  the  initialization,  the 
algorithm  can  converge  to  different  local  minimum,  which  is  common  for  most 
non-convex  variational  methods.  The  result  in  (a)  occurred  five  times  among  the 
20  runs,  and  (b)  for  twice.  The  outcomes  of  other  runs  merge  either  the  upper 
or  the  lower  part  of  the  plume  with  the  background.  The  segmentation  outcome 
shown  in  (a)  gives  higher  energy  than  that  in  (b).  Among  the  20  runs,  the  lowest 
energy  is  achieved  by  a  segmentation  similar  to  (a),  but  with  the  lower  part  of  the 
plume  merged  with  the  background.  It  may  suggest  that  the  global  minimum  of 
the  proposed  energy  does  not  necessarily  give  a  desired  segmentation. 
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Figure  5.4:  Energy  MS(f )  (blue,  solid  line)  and  YT(f)  (red,  dash  line)  at  each 
iteration  from  the  same  test  as  shown  in  Figure  5.3  (a). 

Notice  that  in  Figure  5.3  (b),  even  though  there  actually  exist  five  classes, 
only  four  major  classes  can  be  perceived,  while  the  other  one  contains  only  a  very 
small  amount  of  pixels.  By  allowing  n  =  5  instead  of  h  —  4,  it  helps  to  reduce 
the  influence  of  a  few  abnormal  pixels.  The  computational  time  for  each  iteration 
is  about  2-3  seconds  on  a  1.7GHz  machine  with  Intel  Core  i5.  The  number  of 
iterations  is  around  20-40. 

Figure  5.4  demonstrates  a  plot  of  the  MS(f )  and  YT(f)  energies  at  each  itera¬ 
tion  from  the  same  test  as  the  one  shown  in  Figure  5.3  (a).  The  Lyapunov  energy 
Yr(f  )  (red,  dash  line)  is  non-increasing,  as  proven  in  Theorem  1.  Note  that  all 
the  energies  are  computed  approximately  using  eigenvectors. 

As  a  comparison,  the  segmentation  results  using  K-means  and  spectral  clus¬ 
tering  are  shown  in  Figure  5.5.  The  K-means  method  is  performed  directly  on 
the  raw  image  data  (7  x  128  x  320  by  129).  As  shown  in  (a)  and  (b),  the  re¬ 
sults  obtained  by  K-means  fail  to  capture  the  plume;  the  segmentations  on  the 
background  are  also  very  fuzzy.  For  the  spectral  clustering  method,  a  four-way 
(or  five-way)  K-means  is  implemented  on  the  four  (or  Eve)  leading  eigenvectors 
of  the  normalized  graph  Laplacian  (computed  via  Nystrom).  As  shown  in  (c) 
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(a)  4-way  K-means 


(b)  5-way  K-means 


(c)  Spectral  Clustering  with  4-way  K-means 


(d)  Spectral  Clustering  with  5-way  K-means 


Figure  5.5:  K-means  and  spectral  clustering  segmentation  results.  The  visualiza¬ 
tion  of  the  segmentations  only  includes  the  first  four  frames. 
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(b) 


Figure  5.6:  The  segmentation  results  obtained  by  the  Mumford-Shah  MBO 
scheme,  on  a  background  frame  plus  the  frames  67-72.  Shown  in  (a)  and  (b) 
are  segmentation  outcomes  obtained  with  different  initializations. 

and  (d),  the  resulting  segmentations  divide  the  sky  region  into  two  undesirable 
components.  Unlike  the  segmentation  in  Figure  5.3  (a)  where  the  mountain  com¬ 
ponent  (red,  the  third  in  the  background)  has  a  well  defined  outline,  the  spectral 
clustering  results  do  not  provide  clear  boundaries.  Our  approach  performs  better 
than  other  unsupervised  clustering  results  on  this  dataset  [36,84], 

Another  example  of  the  plume  data  is  show  in  Figure  5.6,  where  the  67th  to 
72nd  frames  (instead  of  the  72nd  to  77th)  are  taken  along  with  the  background 
frame  as  the  test  data.  The  test  is  run  20  times  using  different  uniformly  random 
initialization,  where  r  =  0.15,  A  =  150  and  n  =  5.  The  result  in  Figure  5.6  (a) 
occurred  11  times  among  the  20  runs,  and  (b)  for  five  times.  The  outcomes  from 
the  other  four  runs  segment  the  background  into  three  components  as  in  (a),  but 
merge  the  plume  with  the  center  component.  The  segmentation  result  shown  in 
(a)  gives  the  lowest  energy  among  all  the  outcomes.  The  visualization  includes  all 
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seven  frames  since  the  plume  is  small  in  the  first  several  frames. 

5.3  Conclusion 

In  this  chapter  we  present  a  graph  framework  for  the  multi-class  piecewise  con¬ 
stant  Mumford-Shah  model  using  a  simplex  constrained  representation.  Based 
on  the  graph  model,  we  propose  an  efficient  threshold  dynamics  algorithm,  the 
Mumford-Shah  MBO  scheme  for  solving  the  minimization  problem.  Theoretical 
analysis  is  developed  to  show  that  the  MBO  iteration  decreases  a  Lyapunov  energy 
that  approximates  the  MS  functional.  Furthermore,  in  order  to  reduce  the  com¬ 
putational  cost  for  large  datasets,  we  incorporate  the  Nystrom  extension  method 
to  approximately  compute  a  small  portion  of  the  eigenvectors  of  the  normalized 
graph  Laplacian,  which  does  not  require  computing  the  whole  weight  matrix  of 
the  graph.  After  obtaining  the  eigenvectors,  each  iteration  of  the  Mumford-Shah 
MBO  scheme  is  of  time  complexity  0(n).  The  number  of  iterations  for  conver¬ 
gence  is  small  empirically. 

The  proposed  method  can  be  applied  to  general  high- dimensional  data  seg¬ 
mentation  problems.  In  this  work  we  focus  on  the  segmentation  of  hyper-spectral 
video  data.  Numerical  experiments  are  performed  on  a  collection  of  hyper-spectral 
images  taken  from  a  video  for  plume  detection;  using  our  proposed  method,  com¬ 
petitive  results  are  achieved.  However,  there  are  still  open  questions  to  be  an¬ 
swered.  For  example,  the  Nystrom  method  can  only  compute  eigenvectors  for 
the  normalized  Laplacian,  while  the  theoretical  analysis  for  the  Lyapunov  func¬ 
tional  only  applies  to  the  un-normalized  graph  Laplacian.  This  issue  remains  to 
be  studied.  Note  that  the  graph  constructed  in  this  work  does  not  include  the 
spacial  information  of  the  pixels,  but  only  the  spectral  information.  One  can  cer¬ 
tainly  build  a  graph  incorporating  the  location  of  each  pixel  as  well,  to  generate 
a  non-local  means  graph  as  discussed  in  [8]. 
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CHAPTER  6 


An  Incremental  Reseeding  Strategy 


This  chapter  presents  the  collaborative  work  in  [13],  where  we  propose  a  resampling- 
based  spectral  algorithm  for  multiway  graph  partitioning.  The  algorithm  proceeds 
by  alternating  three  simple  routines  in  an  iterative  fashion:  diffusion,  thresholding, 
and  random  sampling.  On  graphs  that  contain  reasonably  well-balanced  clusters 
of  medium  scale,  the  algorithm  provides  a  strong  combination  of  accuracy,  effi¬ 
ciency  and  robustness  to  noise  in  the  graph  construction  process.  The  algorithm 
is  also  exceedingly  simple,  intuitive  and  trivial  to  implement.  It  also  parallelizes 
trivially,  and  can  therefore  scale  gracefully  to  large  numbers  of  clusters  as  well  as 
to  graphs  with  large  numbers  of  vertices. 

We  validate  these  claims  via  an  extensive  experimental  evaluation  of  the  algo¬ 
rithm.  We  exhaustively  test  our  algorithm  on  a  total  of  26  real  world  data  sets. 
We  also  provide  a  detailed  algorithmic  comparison  using  four  recent  clustering 
algorithms  that  claim  state-of-the-art  results.  These  experiments  demonstrate 
that  our  algorithm  achieves  state-of-the-art  performance  in  terms  of  cluster  pu¬ 
rity  while  running  an  order  of  magnitude  faster  than  the  other  highly  accurate 
clustering  methods  (e.g.  [97])  that  we  compare  against.  Moreover,  our  strategy 
performs  well  using  only  random  initialization.  This  contrasts  with  other  state- 
of-the-art  methods  that  rely  on  a  lower-level  algorithm  (such  as  Normalized  Cuts) 
for  initialization.  We  also  provide  experiments  to  demonstrate  the  robustness  of 
the  algorithm  with  respect  to  noise  and  perturbations  in  the  underlying  graph. 
While  many  highly  accurate  algorithms  exhibit  a  sharp  decrease  in  accuracy  if 


Figure  6.1:  Illustration  of  the  Incremental  Reseeding  (INGRES)  Algorithm  for 
R  =  3  clusters.  The  various  colors  red,  blue,  and  green  identify  the  clusters,  (a): 
At  this  stage  of  the  algorithm,  s  =  2  seeds  are  randomly  planted  in  the  clusters 
computed  from  the  previous  iteration,  (b):  The  seeds  grow  with  the  random  walk 
operator,  (c):  A  new  partition  of  the  graph  is  obtained  and  used  to  plant  s  +  ds 
seeds  into  these  clusters  at  the  next  iteration. 

the  input  graph  is  corrupted  by  noise,  our  algorithm  remains  stable:  the  accu¬ 
racy  of  our  algorithm  decays  slowly  and  gracefully  with  increasing  levels  of  noise. 
These  results,  when  taken  together,  lead  to  an  algorithm  with  a  quite  appealing 
combination  of  simplicity,  performance  and  ease  in  out-of-the-box  usage. 

6.1  Description  of  the  Algorithm 

The  main  idea  behind  our  algorithm  arises  from  a  well-known  and  widely  used 
property  of  the  random  walk  on  a  graph.  Specifically,  a  random  walker  started  in 
a  low  conductance  cluster  is  unlikely  to  leave  that  cluster  quickly  [56].  This  fact 
provides  the  basis  for  transductive  label  propagation  methods  [101]  as  well  as  for 
“local”  clustering  methods  [83].  In  label  propagation,  for  instance,  an  oracle  pro¬ 
vides  a  set  of  labeled  vertices  that  are  propagated  along  the  graph  using  a  random 
walk  matrix  or  a  diffusion  matrix.  Each  unlabeled  vertex  is  then  associated  to  the 
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label  which,  after  being  propagated,  best  represents  the  given  unlabeled  vertex. 

Our  algorithm  simply  iterates  upon  this  basic  idea.  Assume  that  the  graph 
has  R  well-defined  clusters  of  comparable  size  and  low  conductance.  If  we  knew 
these  clusters  in  advance,  we  could  then  select  a  handful  of  “seed”  vertices  in 
the  center  of  each  cluster.  We  would  then  expect  to  obtain  good  results  from  a 
transductive  label  propagation  by  using  these  seeds  as  labels.  In  an  unsupervised 
context  we  cannot,  of  course,  a-priori  place  seeds  in  the  center  of  each  cluster.  To 
overcome  this,  we  instead  place  a  handful  of  seeds  at  random.  We  then  apply  a 
random  walk  matrix  or  diffusion  matrix  a  few  times  to  propagate  these  seeds.  We 
finally  obtain  a  temporary  clustering  by  assigning  each  vertex  to  the  seed  which, 
after  propagation,  best  represents  the  vertex.  We  then  choose  new  seeds  from 
these  temporary  clusters  and  iterate  the  process.  If  the  clusters  improve  then 
the  seeds  will  likely  improve,  and  vice-versa.  This  incites  a  feed-back  loop  and 
we  get  a  virtuous  cycle.  We  can  then  excite  the  speed  and  improve  the  quality 
of  this  cycle  by  gradually  drawing  more  and  more  seeds  throughout  the  process. 
We  refer  to  this  idea  as  an  incremental  reseeding  strategy.  Figure  6.1  depicts  this 
cyclic  process  graphically. 

6.1.1  Basic  algorithm 

To  formalize  these  ideas,  recall  that  (G,  E)  denotes  a  weighted,  connected  graph 
on  N  vertices  G  =  {ni, . . . ,  n^}  with  edge  weights  E  =  {uy,}^=1  that  encode  a 
measure  of  similarity  between  each  pair  (rq,  rij)  of  vertices.  The  algorithm  starts 
from  a  random  partition  V  =  (Ci, . . .  ,Cr),  C\  U  •  •  •  U  Cr  —  G,  Cr  fl  Cq  —  0 
(r  ^  q)  of  the  vertices.  In  other  words,  each  nt  is  assigned  to  one  of  the  R  clusters 
uniformly  at  random.  Let  s  —  1  denote  the  initial  number  of  seeds.  At  each  of  the 
successive  iteration,  we  update  the  current  partition  V  =  (Ci, . . . ,  Cr)  according 
to  the  steps  outlined  in  the  pseudocode  Algorithm  3  for  INGRES.  The  pseudocode 
for  the  INGRES  Subroutines  is  presented  in  Algorithm  4. 
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Algorithm  3  INGRES  Algorithm 

Input:  Similarity  matrix  W,  seed  increment  ds,  number  of  clusters  R. 

Initialization:  s  —  1,  random  partition  V. 

repeat 

F  =  PLANT('P,  s) 

F  <-  GROW (F,  W) 

V  <-  HARVEST(F) 
s  i —  s  +  ds 
until  V  converges 
Output:  V 

At  the  beginning  of  each  iteration,  the  routine  PLANT('P,s)  will  sample  s 
seeds  from  each  of  the  R  clusters  Cr  in  the  current  partition  V  uniformly  at  ran¬ 
dom.  These  Rs  seeds  furnish  the  temporary  labels  that  the  GROW  routine  then 
propagates  along  the  graph  using  a  random  walk  matrix.  We  initialize  GROW 
with  an  N  x  R  matrix  F,  where  F^r  =  1  if  vertex  Vi  was  drawn  from  cluster  Cr 
and  Fi  r  =  0  otherwise.  We  then  iteratively  apply  the  random  walk  transition 
matrix  WD  1  to  F  until  each  vertex  has  a  nonzero  probability  of  being  visited  by 
a  random  walker,  i.e.  until  each  entry  Fir  is  nonzero.  Finally,  the  routine  HAR¬ 
VEST  simply  assigns  each  vertex  vt  to  its  most  likely  cluster,  or  in  other  words  the 
cluster  for  which  Ft  r  is  maximal.  This  produces  a  new  partition  V  of  the  vertices 
into  R  clusters.  We  then  increment  s  to  s  +  ds,  and  use  this  partition  and  number 
of  seeds  s  to  initialize  PLANT  at  the  beginning  of  the  next  iteration.  We  refer  to 
this  overall  procedure  as  the  Incremental  Reseeding  Algorithm  (INGRES). 

The  overall  routine  has  only  a  single  parameter  ds  that  controls  the  linear  rate 
at  which  the  number  of  seeds  drawn  at  each  iteration  increases.  In  practice,  we 
select 

ds  =  speed  x  10-4  x  —  (6.1) 

R 

for  some  proportionality  constant  speed  between  one  and  ten.  By  rescaling  ds 
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Algorithm  4  INGRES  Subroutines 
function  PLANT^s) 

Initialize  F  as  an  N-by-R  matrix  of  zeros. 

for  r  =  1  to  R  do 

for  k  =  1  to  s  do 

Draw  at  random  a  vertex  v%  in  cluster  Cr. 

TP.  * —  TP  4-1 
i,r  ~  L  i,r  i  -1- 

end  for 
end  for 
return  F. 
end  function 

function  GROW(F,  W) 
while  min,  minr  Fi  r  =  0  do 
F  4r-  (WD^1)  F 

end  while 
return  F. 
end  function 

function  HARVEST (F) 
for  r  =  1  to  R  do 

Cr  =  {i  :  Fi  r  >  Fi,.s  for  all  s} 

end  for 

return  V  =  (Ci, . . .  ,CR) 

end  function 
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in  this  way,  a  constant  of  proportionality  speed=l  corresponds  to  a  total  of 
s  =  0.1  N/R  seeds  planted  in  each  cluster  after  1000  iterations.  Assuming  well- 
balanced  clusters  of  roughly  equal  size,  approximately  one-tenth  of  each  cluster 
is  sampled  after  1000  iterations.  Drawing  a  significant  fraction  of  each  cluster 
will  cause  the  subsequent  clustering  to  stabilize,  leading  to  eventual  convergence 
of  the  algorithm.  Using  around  10%  of  labels  in  each  cluster  is  a  typical  level  at 
which  INGRES  will  stabilize. 

The  parameter  ds  therefore  represents  a  “timestep”  for  INGRES,  and  the 
overall  algorithm  behaves  well  with  respect  to  this  parameter.  In  general,  small 
increments  ds  will  lead  to  slower  convergence  at  higher  accuracy  while  larger  in¬ 
crements  ds  will  lead  to  faster  convergence  but  potentially  less  accurate  solutions. 
Our  experiments  show  that  speed  =  1  works  remarkably  well  for  a  large  variety 
of  data  sets.  We  also  provide  results  for  speed=5,  which  yields  faster  stabiliza¬ 
tion  with  slightly  less  accuracy,  to  show  that  the  algorithm  is  indeed  robust  and 
predictable  with  respect  to  the  choice  of  this  parameter. 

The  general  INGRES  framework  also  proves  robust  to  implementation  choices 
for  the  three  main  routines.  For  instance,  in  addition  to  the  random  walk  matrix 
WD_1,  there  exists  a  variety  of  alternative  means  to  propagate  labels  along  a 
graph.  By-and-large,  the  overall  INGRES  strategy  does  not  depend  heavily  upon 
the  particular  implementation  of  GROW,  so  long  as  it  realizes  the  basic  idea  of 
label  propagation  in  one  form  or  another.  For  instance,  we  have  found  that  re¬ 
placing  the  random  walk  step  F  •<—  WD_1F  with  a  diffusion  step  F  •<—  D_1WF 
or  F  t—  D_1/2WD_1/2.F  will  give  similar  results  in  many  circumstances.  Occa¬ 
sionally,  we  have  found  that  utilizing  a  “personalized  Page-Rank”  step 

F  t—  aWD_1f  +  (1  —  a)F0  (6.2) 

can  give  better  performance  on  small  data  sets  that  contain  a  large  (relative  to  the 
size  of  the  data  set)  number  of  clusters.  Here  the  parameter  0  <  a  <  1  denotes  a 
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length-scale  that  controls  the  extent  of  diffusion  and  Fq  denotes  the  input  to  the 
GROW  routine.  A  propagation  step  of  the  form  (6.2)  is  also  used  in  Pagerank- 
N1BBLE  [2]  and  NMFR  [97],  up  to  replacing  WD  1  with  D_1/2WD-1/2  in  the 
latter  case.  As  another  example,  choosing  to  sample  with  or  without  replacement 
in  the  PLANT  routine  leads  to  essentially  no  significant  difference  in  the  resultant 
clusterings. 

6.1.2  Relation  with  other  work 

Our  methodology  relies  upon  and  incorporates  number  of  ideas  from  transductive 
learning.  In  particular,  we  leverage  the  notion  of  label  propagation  [101].  In 
the  standard  label  propagation  framework,  an  oracle  provides  a  set  of  labeled 
points  or  vertices.  These  labeled  points  form  either  nonzero  initial  conditions 
or  heat  sources  for  a  discrete  heat  equation  on  the  graph.  The  second  step  of 
the  INGRES  algorithm  (the  GROW  routine)  precisely  corresponds  with  a  label 
propagation  of  the  random  labels  returned  from  the  first  step  of  the  algorithm 
(the  PLANT  routine). 

The  NIBBLE  algorithm  and  its  relatives  [2,56,82,83]  use  a  similar  idea  to  get 
an  unsupervised  clustering  method  from  label  propagation  by  planting  random 
seeds.  These  works  cluster  the  entire  graph  in  a  sequential  manner:  at  each  step 
a  single  random  vertex  is  drawn  and  propagated.  Then  a  sweep  is  performed 
to  extract  a  small  cluster  around  this  vertex.  These  algorithms  function  well  for 
problems  aimed  at  extracting  many  small  clusters  from  graphs  with  fine  structure. 
In  contrast,  we  perform  multiway  partitioning  directly  instead  of  recursively.  We 
also  aim  at  medium  scale  clusters  instead  of  small  scale  clusters.  We  also  utilize 
a  significantly  different  random  seeding  strategy. 

The  INGRES  algorithm  also  alternates  between  label  propagation  (GROW) 
and  thresholding  (HARVEST).  The  idea  of  iteratively  alternating  between  a  few 
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steps  of  label  propagation  and  subsequent  thresholding  has  also  appeared  in  a 
transductive  learning  context  [35,59],  although  the  presence  of  labeled  information 
results  in  a  different  implementation  of  the  propagation  step.  The  non-negative 
matrix  factorization  method  [97]  also  incorporates  random  walk  information  in  a 
manner  that  resembles  the  GROW  routine,  but  otherwise  the  underlying  princi¬ 
ples  of  the  algorithms  differ  substantially. 

Because  the  GROW  function  we  use  iterates  the  random  walk  on  the  graph,  our 
algorithm  is  a  form  of  spectral  clustering.  However,  our  main  contribution  to  the 
clustering  problem,  and  the  primary  novelty  in  our  algorithm,  is  the  incremental 
reseeding  process.  This  process  is  not  fundamentally  tied  to  the  INGRES  algorithm 
presented  here  —  it  seems  to  be  quite  universal  and  can  be  adapted  to  other 
clustering  methods.  However,  combining  reseeding  with  the  random  walk  method 
offers  an  excellent  combination  of  accuracy,  speed,  and  robustness. 

6.2  Experiments 

We  now  provide  the  results  of  our  extensive  experimental  evaluation  of  the  al¬ 
gorithm.  This  section  shows  that  our  algorithm  achieves  state-of-the-art  perfor¬ 
mance  in  terms  of  cluster  purity  on  a  variety  of  real  word  data  sets  while  running 
an  order  of  magnitude  faster  than  the  other  comparably  accurate  clustering  meth¬ 
ods.  Moreover,  whereas  some  of  the  competing  algorithms  are  initialized  with  a 
partition  provided  by  a  low  cost  algorithm  such  as  NCut,  the  INCRES  algorithm 
is  initialized  with  a  random  partition.  Finally  we  show  that  INCRES  is  very 
robust  to  perturbations  in  the  input  graph. 

The  Algorithms:  We  compare  our  method  against  four  clustering  algorithms 
that  rely  on  variety  of  different  principles.  We  select  algorithms  that,  like  our 
algorithm,  partition  the  graph  in  a  direct,  non-recursive  manner.  The  NCut  algo¬ 
rithm  [98]  is  a  widely  used  spectral  algorithm  that  relies  on  a  post-processing  of  the 
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Table  6.1:  Algorithmic  Comparison  via  Cluster  Purity. 


Data 

size 

R 

RND 

NCut 

LSD 

NMFR 

MTV 

INCRES 

( speed  1 ) 

INCRES 

(speed  5) 

20NEWS 

20K 

20 

6.3% 

26.6% 

34.3% 

60.7% 

35.8% 

61.0% 

60.7% 

CADE 

21K 

3 

15.5% 

41.0% 

41.3% 

52.0% 

44.2% 

52.8% 

52.1% 

RCV1 

9.6K 

4 

30.3% 

38.2% 

38.1% 

42.7% 

42.8% 

54.5% 

51.2% 

WEBKB4 

4.2K 

4 

39.1% 

39.8% 

45.8% 

58.1% 

45.2% 

57.1% 

56.8% 

CITESEER 

3.3K 

6 

21.8% 

23.4% 

53.4% 

62.6% 

42.6% 

62.0% 

62.2% 

MNIST 

70K 

10 

11.3% 

76.9% 

75.5% 

97.1% 

95.5% 

96.0% 

94.0% 

PENDIGIT 

11K 

10 

11.6% 

80.2% 

86.1% 

86.8% 

86.5% 

88.8% 

85.9% 

USPS 

9.3K 

10 

16.7% 

71.5% 

70.4% 

86.4% 

85.3% 

87.8% 

87.4% 

OPTDIGIT 

5.6K 

10 

12.0% 

90.8% 

91.0% 

98.0% 

95.2% 

97.4% 

94.8% 

eigenvectors  of  the  graph  Laplacian  to  optimize  the  normalized  cut  energy.  The 
NMFR  algorithm  [97]  uses  non-negative  matrix  factorization  and  graph-based 
random  walk  principles  in  order  to  factorize  and  regularize  the  original  input 
similarity  matrix.  The  LSD  algorithm  [4]  provides  another  non-negative  matrix 
factorization  algorithm.  It  aims  at  finding  a  left-stochastic  decomposition  of  the 
similarity  matrix.  The  MTV  algorithm  from  [15]  provides  a  total-variation  based 
algorithm  that  attempts  to  find  an  optimal  multiway  Cheeger  cut  of  the  graph  by 
using  L\  optimization  techniques.  The  last  three  algorithms  (NMFR,  LSD  and 
MTV)  all  use  NCut  in  order  to  obtain  an  initial  partition.  By  contrast,  we  initial¬ 
ize  our  algorithm  with  a  random  partition.  We  use  the  code  available  from  [98] 
for  NCut,  the  code  available  from  [97]  to  test  the  two  non-negative  matrix  fac¬ 
torization  algorithms  (NMFR  and  LSD)  and  the  code  available  from  [15]  for  the 
MTV  algorithm. 

The  Data  Sets:  We  provide  experimental  results  on  five  text  data  sets 
(20NEWS,  CADE,  RCV1,  WEBKB4,  CITESEER)  and  four  data  sets  contain- 
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ing  images  of  handwritten  digits  (MNIST,  PENDIGITS,  USPS,  OPTDIGITS). 
We  processed  the  text  data  sets  by  removing  a  list  of  stop  words  as  well  as  by 
removing  all  words  with  fewer  than  twenty  occurrences  (for  20NEWS)  and  fewer 
than  five  occurrences  (for  all  others)  across  the  corpus.  We  then  construct  a  5-NN 
graph  based  on  the  cosine  similarity  between  tf-idf  features.  For  variety,  we  in¬ 
clude  some  weighted  graphs  (RCV1  and  CITESEER)  as  well  as  some  unweighted 
graphs  (20NEWS,  CADE  and  WEBKB4).  For  MNIST,  PENDIGITS  and  OPT¬ 
DIGITS  we  use  the  similarity  matrices  constructed  by  [97],  where  the  authors 
first  extract  scattering  features  [19]  for  images  before  calculating  an  unweighted 
10-NN  graph.  For  USPS  we  constructed  a  weighted  10-NN  graph  from  the  raw 
data  without  any  preprocessing.  The  source  for  these  data  sets  and  more  details 
on  their  construction  are  provided  in  the  appendix  of  [13]. 


(a)  20NEWS 


(b)  MNIST 


Figure  6.2:  Purity  Distributions 


6.2.1  Accuracy  comparisons 

In  Table  6.1  we  report  the  accuracy  obtained  by  the  selected  algorithms  LSD, 
NMFR,  MTV  and  INCRES  (for  two  values  of  the  timestep  parameter,  speed  =  1 
and  speed  =  5)  on  the  various  data  sets.  Cluster  purity,  as  mentioned  in  previ- 
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ous  chapters,  is  used  to  quantify  the  quality  of  the  calculated  partition,  defined 
according  to  the  relation 


Purity 


number  of  “successes” 
N 


1 

N 


R 


£ 


max  nri. 
1<  i<R  ' 


Here  nr:i  denotes  the  number  of  data  points  in  the  rth  cluster  that  belong  to  the 
jth  grounr]_truth  class.  In  other  words,  given  a  computed  cluster  we  count  a  data 
point  as  a  success  if  it  belongs  to  the  ground  truth  class  that  best  represents 
the  cluster.  We  allowed  each  iterative  algorithm  a  total  of  10,000  iterations  to 
reach  convergence.  Both  INGRES  and  MTV  rely  on  randomization,  so  for  these 
algorithm  we  report  the  average  purity  achieved  over  1000  different  runs.  The 
fourth  column  of  the  table  (RND)  provides  a  base-line  purity  for  reference,  i.e. 
the  purity  obtained  by  assigning  each  data  point  to  a  class  from  1  to  R  uniformly 
at  random.  The  boldface  numbers  in  the  table  indicate  the  highest  purity  score 
achieved  on  each  data  set. 


Overall,  INGRES  and  NMFR  significantly  outperform  the  other  algorithms. 
This  is  especially  true  for  text  data  sets.  Both  algorithms  utilize  a  random  walk 
strategy  to  help  “smooth”  irregular  graphs,  such  as  the  similarity  matrices  ob¬ 
tained  from  text  data  sets.  This  strategy  also  contributes  to  the  robustness  of 
these  algorithms  and  to  their  solid  performance  across  the  full  range  of  data  sets. 
However,  the  INGRES  algorithm  typically  runs  at  least  one  order  of  magnitude 
faster  than  the  NMFR  algorithm.  Due  to  the  similarity  of  their  results,  we  provide 
a  more  exhaustive  comparison  between  these  two  algorithms  in  the  supplementary 
material  on  17  additional  data  sets. 


Finally,  note  that  the  INGRES  algorithm  performs  comparably  when  speed  = 
1  and  speed  =  5,  demonstrating  that  the  algorithm  is  robust  with  respect  to  the 
choice  of  the  seed  increment  parameter  ds.  The  INGRES  algorithm  also  performs 
rather  consistently  across  independent  runs  of  the  random  process.  Figure  6.2 
provides  an  experimental  validation  of  this  fact.  Specifically,  this  figure  displays  a 
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histogram  of  the  purity  values  obtained  over  1008  independent  trials  of  INGRES  on 
both  20NEWS  and  MNIST  (using  speed  =  1  for  this  example).  On  20NEWS,  all 
independent  trials  of  the  INGRES  algorithm  converge  to  a  solution  with  a  purity 
value  between  58%  and  64%.  The  INGRES  algorithm  converges  to  one  of  two 
attractors  on  MNIST,  one  at  roughly  88%  purity  and  one  at  roughly  97%  purity, 
with  the  vast  majority  of  all  trials  converging  to  the  higher-accuracy  attractor. 

6.2.2  Speed  comparisons 

Figure  6.3  illustrates  the  speed  at  which  the  iterative  algorithms  (LSD,  MTV, 
NMFR  and  INGRES)  converge  toward  their  respective  solutions.  We  ran  each 
algorithm  for  a  total  of  7  minutes  on  20NEWS  and  for  15  minutes  on  MNIST.  We 
report  the  purity  obtained  by  the  algorithm  at  each  iteration.  For  the  randomized 
algorithm  (INGRES  and  MTV)  the  purity  curves  were  obtained  by  averaging  the 
results  over  240  runs.  The  overwhelming  computational  burden  for  all  of  these 
algorithms  arises  from  the  sparse- matrix  times  full-matrix  multiplications  required 
at  each  step.  Each  algorithm  is  implemented  in  a  fair  and  consistent  way,  and  the 
experiments  were  all  performed  on  the  same  architecture. 


Table  6.2:  Computational  Time 


Data 

INCRES  INCRES 

NMFR  MTV 

( speedl )  ( speed5 ) 

20NEWS 

3.7mn  -  25.4s  5.6s 

(57.7%)  -  (57.7%)  (58%) 

MNIST 

4.6mn  1.8mn  4.8s  3.1s 

(92.2%)  (90.7%)  (91.2%)  (89.3%) 

In  order  to  give  an  indication  of  the  speed/accuracy  trade-off  for  each  algo¬ 
rithm,  in  Table  6.2  we  record  the  time  it  took  for  the  purity  obtained  by  each  algo¬ 
rithm  to  reach  95%  of  its  limiting  value  on  both  20NEWS  and  on  MNIST.  Overall, 
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(a)  20NEWS  (first  90sec) 


(b)  20NEWS  (first  7mn) 


(c)  MNIST  (first  30sec)  (d)  MNIST  (first  15mn) 


Figure  6.3:  Purity  curves  for  the  four  algorithms  considered  on  two  benchmark 
data  sets  (20NEWS  and  MNIST).  We  plot  purity  against  time  for  each  algorithm 
over  two  different  time  windows.  The  circular  marks  on  each  curve  indicate  the 
point  at  which  the  curve  reaches  95%  of  its  limiting  value.  The  corresponding 
times  at  which  this  happens  are  reported  in  Table  6.2. 


the  simple  INCRES  algorithm  provides  accuracy  comparable  to  the  state-of-the- 
art  NMF  algorithm  [97],  yet  runs  an  order  of  magnitude  faster.  Timing  results  on 
the  data  sets  from  table  6.1  are  consistent  with  those  obtained  for  20NEWS  and 
MNIST,  in  the  sense  that  INCRES  typically  runs  one  order  of  magnitude  faster 
than  NMFR  on  these  data  sets  as  well. 
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Table  6.3:  Robustness  Comparisons 


Noise 

NCut 

LSD 

MTV 

INCRES 

( speedl ) 

20NEWS 

+0%  edges 

27% 

34% 

36% 

61% 

+50%  edges 

21% 

27% 

20% 

52% 

+100%  edges 

18% 

22% 

11% 

44% 

+150%  edges 

15% 

20% 

10% 

34% 

+200%  edges 

14% 

18% 

9% 

27% 

MNIST 

+0%  edges 

77% 

76% 

96% 

96% 

+50%  edges 

87% 

94% 

55% 

97% 

+100%  edges 

84% 

93% 

25% 

97% 

+150%  edges 

74% 

87% 

18% 

97% 

+200%  edges 

67% 

82% 

16% 

96% 

6.2.3  Robustness  experiments 

Table  6.3  reports  accuracy  results  of  various  algorithms  on  graphs  that  we  cor¬ 
rupted  by  adding  different  levels  of  noise.  We  began  with  the  original  20NEWS 
graph  used  in  Table  6.1  and  added  additional  edges  to  the  graph  uniformly  at 
random.  The  original  graph  had  e  =  144,  632  edges.  For  the  experiment,  we 
added  0.5e,  e,  1.5e  and  2e  additional  noise  edges.  For  each  of  these  four  levels  of 
noise,  we  randomly  generated  144  separate  perturbed  graphs.  The  table  reports, 
for  each  level  of  noise,  the  average  purity  obtained  by  each  algorithm  on  the  144 
randomly  generated  matrices.  We  then  proceeded  to  perturb  the  original  MNIST 
graph  in  a  similar  fashion.  The  original  graph  has  e  =  1,027,412  edges,  and  we 
randomly  generated  120  graphs  at  each  level  of  noise.  This  gives  a  total  of  1056 
randomly  generated  graphs  for  this  set  of  experiments.  We  provide  experimental 
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results  for  all  algorithms  other  than  NMFR,  which  simply  takes  far  too  much 
time  to  run  to  convergence  on  all  1056  adjacency  matrices.  Nevertheless,  given 
the  similarity  in  their  performace,  we  expect  that  NMFR  would  perform  similarly 
to  INGRES  on  this  set  of  experiments  as  well. 

The  results  clearly  elucidate  the  robustness  of  the  INGRES  algorithm  with 
respect  to  noise  in  the  graph  construction  process.  On  the  20NEWS  data  set,  for 
example,  all  other  algorithms  experience  a  sharp  decrease  in  accuracy  as  soon  as 
noise  is  added.  In  contrast,  the  purity  of  the  INGRES  algorithm  slowly  decreases 
in  a  stable  fashion.  On  the  MNIST  data  sets,  the  results  obtained  by  INGRES 
remain  essentially  unchanged  across  all  noise  levels.  The  competing  algorithms 
do  not  exhibit  this  behavior.  Interestingly,  NCut  and  LSD  actually  obtain  better 
results  at  the  50%  and  100%  noise  levels.  Given  that  LSD  relies  on  NCut  for 
initialization,  it  comes  as  no  surprise  that  gains  for  NCut  produce  subsequent  gains 
for  LSD  as  well.  This  pathological  behavior  still  indicates  a  lack  of  robustness,  in 
the  sense  that  both  algorithms  exhibit  a  high  degree  of  sensitivity  to  changes  in 
the  underlying  graph. 
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